<h1>User Experience Analysis</h1>

In [0]:
%pip install nltk

In [0]:
import nltk

In [0]:
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, lit, when, udf, explode, lower, count, size, isnan, sum
from pyspark.sql.types import IntegerType, ArrayType, StringType
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
from scipy.stats import shapiro, normaltest

In [0]:
spark = SparkSession.builder.appName("analysis_painpoint").getOrCreate()

In [0]:
reddit_url_to_exclude = ["https://www.reddit.com/r/korea/comments/1it9gty/exclusive_being_taken_prisoner_is_treason_in/", "https://www.reddit.com/r/korea/comments/11a53p7/colonial_police_warned_residents_about_police/", "https://www.reddit.com/r/korea/comments/13wkuv7/this_is_how_my_ukrainian_neighbor_responded_to/", "https://www.reddit.com/r/korea/comments/w6phlt/colonial_authorities_discussed_how_to_reduce/", "https://www.reddit.com/r/korea/comments/15xk99n/autopsy_identifies_strangulation_as_preliminary/", "https://www.reddit.com/r/korea/comments/6afzls/my_complicated_visa_issue_with_seemingly_no/"]

In [0]:
appstore_df = spark.table("workspace.growth_poc.silver_appstore_reviews") \
                    .filter(year("updated") >= 2023)\
                   .select(
                       col("updated").alias("review_date"),
                       col("rating").alias("review_rate"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("thumbsUpCount").alias("review_thumbsUpCount"),
                       col("appName"),
                       col("country"),
                       col("language")
                   ) 
playstore_df = spark.table("workspace.growth_poc.silver_playstore_reviews") \
                    .filter(year("at") >= 2023)\
                    .select(
                       col("at").alias("review_date"),
                       col("score").alias("review_rate"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("thumbsUpCount").alias("review_thumbsUpCount"),
                       col("appName"),
                       col("language")
                   ) 
reddit_df = spark.table("workspace.growth_poc.silver_reddit_reviews")\
                    .filter((year("created_datetime") >= 2023) & ~(col("url").isin(reddit_url_to_exclude))) \
                    .select(
                       col("created_datetime").alias("review_date"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("score").alias("review_thumbsUpCount"),
                       col("language")
                   ) 
                    
review_contents_df = appstore_df.unionByName(playstore_df,allowMissingColumns=True) \
                                .unionByName(reddit_df,allowMissingColumns=True)


In [0]:
# create temp view for faster access in later steps
review_contents_df.createOrReplaceTempView("review_contents_temp_view")
review_contents_df = spark.sql("SELECT * FROM review_contents_temp_view")

In [0]:
print(appstore_df.count())
print(playstore_df.count())
print(reddit_df.count())

<h2>Check Data</h2>

In [0]:
# 1. check schema
review_contents_df.printSchema()
print()
# 2. check column names
print(review_contents_df.columns)
print()
# 3. check data typue
print(review_contents_df.dtypes)
print()
# 4. check number of rows
print(review_contents_df.count())
print()
# 5. check statistical status
review_contents_df.describe().show()

# 6. check missing values
# get numeric columns
numeric_cols = [name for name, dtype in review_contents_df.dtypes if dtype in ('double', 'float', 'bigint')]
# numeric columns: count null values
for c in numeric_cols:
    null_count = review_contents_df.select(count(when(col(c).isNull() | isnan(c), c))).collect()[0][0]
    print(f"{c}: {null_count} nulls")



##1. Word Frequency Analysis
To find **common keywords** repeatedly appearing in negative reviews and **biggest pain points** the users are experiencing, I conducted simple frequency analysis and weighted frequency analysis.

<h3>1-1. Simple Frequency Analysis</h3>

<h4>1-1-1. Single Keyword Frequency Analysis </h4>

In [0]:
keyword_analysis_df = review_contents_df.select("review_rate", "review_content", "review_words", "review_thumbsUpCount")

In [0]:
def detect_negative_review(sentence):
    if not sentence: # None or empty string
        return 0 
    score = analyzer.polarity_scores(sentence)["compound"]
    if score < 0:
        return 1
    elif score > 0:
        return -1
    else: return 0

In [0]:
# Use VADER for sentiment analysis
analyzer = SentimentIntensityAnalyzer()

# Mark negative reviews
# If review_rate is 2 or less than 2, mark as negative
# If review_rate is not available, use VADER to detect
detect_negative_review_udf = udf(detect_negative_review, IntegerType())
mark_negative_reviews_df = keyword_analysis_df.withColumn("is_negative", \
    when(col("review_rate").isNull(), detect_negative_review_udf(col("review_content")))\
    .when(col("review_rate") <= 2, 1)\
    .when(col("review_rate") > 3, -1)\
    .otherwise(0)   
)

# get only negative reviews
negative_df = mark_negative_reviews_df.filter((col("is_negative") == 1))

# flatten words
words_exploded = negative_df.select(explode(col("review_words")).alias("word"), col("review_thumbsUpCount"))

# set lowercase and count 
word_counts = words_exploded.withColumn("word", lower(col("word"))) \
                            .groupBy("word").count() \
                            .orderBy(col("count").desc()).limit(100)

word_counts.show(100, truncate= False)


| word      | count |
|-----------|-------|
| delivery  | 161   |
| order     | 153   |
| food      | 147   |
| korea     | 139   |
| korean    | 134   |
| get       | 132   |
| app       | 130   |
| time      | 128   |
| card      | 101   |
| one       | 91    |
| even      | 90    |
| like      | 87    |
| use       | 80    |
| number    | 80    |
| thing     | 72    |
| phone     | 71    |
| go        | 68    |
| restaurant| 67    |
| work      | 61    |
| foreigner | 60    |
| service   | 60    |
| also      | 59    |
| back      | 57    |
| make      | 57    |
| need      | 56    |
| know      | 56    |
| people    | 54    |
| place     | 54    |
| problem   | 53    |
| pay       | 52    |
| country   | 52    |
| review    | 51    |
| item      | 51    |
| driver    | 49    |
| want      | 49    |
| really    | 48    |
| way       | 47    |
| day       | 47    |
| never     | 44    |
| every     | 44    |
| take      | 44    |
| account   | 44    |
| still     | 41    |
| something | 40    |
| got       | 40    |
| think     | 39    |
| come      | 38    |
| door      | 37    |
| money     | 37    |
| going     | 37    |
| bad       | 36    |
| lot       | 35    |
| used      | 35    |
| without   | 34    |
| good      | 34    |
| id        | 33    |
| coupang   | 33    |
| try       | 33    |
| english   | 32    |
| apps      | 32    |
| much      | 32    |
| option    | 31    |
| year      | 31    |
| call      | 31    |
| ordered   | 31    |
| live      | 31    |
| customer  | 30    |
| see       | 30    |
| bank      | 30    |
| issue     | 30    |
| many      | 29    |
| say       | 29    |
| u         | 29    |
| worst     | 28    |
| ever      | 28    |
| payment   | 28    |
| said      | 28    |
| using     | 27    |
| since     | 27    |
| tourist   | 27    |
| foreign   | 27    |
| sure      | 27    |
| hour      | 26    |
| first     | 26    |
| may       | 26    |
| delivered | 25    |
| able      | 25    |
| always    | 25    |
| could     | 25    |
| someone   | 24    |
| leave     | 24    |
| new       | 24    |
| tip       | 24    |
| tried     | 23    |
| wont      | 23    |
| living    | 23    |
| find      | 23    |
| wrong     | 23    |
| minute    | 23    |
| hard      | 23    |


In [0]:
print(negative_df.count())

I extracted the 100 most frequently occurring words in 520 negative reviews and categorized those into three key problems. </br>
<ol>
<li><b>Foreigner-Specific Issues</b></li>
korean(134), english(32), foreigner(60), foreign(27), call(31), tourist(27)<br/>
This shows there are high possibilities of language-related or systemic difficulty for foreigners to use the apps.

<li><b>Delivery Service Quality</b></li>
time(128), service(60), restaurant(67), door(37), item(51), driver(49), option(31), customer(30)<br/>
This shows there are issues with user experience with the app usage, such as delivery time, restaurant service, delivery issue, communication with driver, etc.

<li><b>Payment & Verification</b></li>
card(101), use(80), number(80), phone(71), pay(52), account(44), money(37), id(33), bank(30), payment(28)<br/>
This shows there are difficulties with completing orders due to payment or verification issues. I assume the problems will be related to "foreign card" or "phone verification", etc.

<li><b>Special Attention</b></li>
The mention of the specific name "coupang (33)" is a unique point. Through N-gram analysis, it can be determined whether this refers to the Coupang company itself or to the Coupang Eats app.
</ol>


<h4>1-1-2. N-gram Analysis</h4>

In [0]:
# take a list and return a list of tuples cotaining two words
def create_bigrams_from_list(words):
    if not words or len(words) < 2:
        return []
    bigrams_list = list(zip(words[:], words[1:])) # zip stops when the short sized list meets the end
    # words[:]   = ['I',    'EAT',    'BANANA']
    # words[1:]  = ['EAT',  'BANANA']
    # => [('I', 'EAT'), ('EAT', 'BANANA')]


    bigrams = [" ".join(grams) for grams in bigrams_list]
    return bigrams

# take a list and return a list of tuples cotaining three words
def create_trigrams_from_list(words):
    if not words or len(words) < 3:
        return []
    trigrams_list = list(zip(words[:], words[1:], words[2:])) 

    trigrams = [" ".join(grams) for grams in trigrams_list]
    return trigrams

 # register the function as udf 
create_bigrams_udf = udf(create_bigrams_from_list, ArrayType(StringType()))
create_trigrams_udf = udf(create_trigrams_from_list, ArrayType(StringType()))

In [0]:
# get biagram result
bigrams_df = negative_df.withColumn("keywords_paired", 
                                    create_bigrams_udf(col("review_words"))) \
                        .select("keywords_paired", "review_thumbsUpCount")

In [0]:
# flatten keywords and aggregate (count)
bigrams_flat_df = bigrams_df.select(explode(col("keywords_paired")).alias("keywords"))\
                            .groupBy("keywords")\
                            .count()\
                            .orderBy(col("count").desc())

bigrams_flat_df.show(50,truncate = False)

**Bigram Analysis Results**
| keywords        | count |
|-----------------|-------|
| phone number    | 43    |
| food delivery   | 24    |
| customer service| 19    |
| credit card     | 18    |
| korean phone    | 18    |
| delivery driver | 13    |
| delivery service| 11    |
| order delivery  | 11    |
| food delivered  | 11    |
| coupang eats    | 11    |
| use app         | 10    |
| delivery apps   | 9     |
| cancel order    | 9     |
| foreign card    | 9     |
| first time      | 9     |
| bank account    | 9     |
| apple pay       | 8     |
| delivery app    | 8     |
| korean bank     | 8     |
| feel like       | 8     |
| gon na          | 8     |
| app ever        | 7     |
| even though     | 7     |
| every time      | 7     |
| bank card       | 7     |
| every country   | 7     |
| money back      | 6     |
| uber eats       | 6     |
| order food      | 6     |
| worst app       | 6     |
| go back         | 6     |
| front door      | 6     |
| delivery guy    | 6     |
| delivery time   | 6     |
| need korean     | 6     |
| place live      | 6     |
| thing like      | 5     |
| make sure       | 5     |
| get money       | 5     |
| negative review | 5     |
| app even        | 5     |
| new one         | 5     |
| payment card    | 5     |
| order something | 5     |
| without korean  | 5     |
| thing korea     | 5     |
| waste time      | 5     |
| hard time       | 5     |
| able use        | 5     |
| tmoney card     | 5     |




In [0]:
# get biagram result
trigrams_df = negative_df.withColumn("keywords_paired", 
                                    create_trigrams_udf(col("review_words"))) \
                         .select("keywords_paired", "review_thumbsUpCount")

In [0]:
trigrams_flat_df = trigrams_df.select(explode(col("keywords_paired")).alias("keywords"))\
                            .groupBy("keywords")\
                            .count()\
                            .orderBy(col("count").desc())

trigrams_flat_df.show(20, truncate = False)

**Trigram Analysis Result**
| keywords                | count |
|--------------------------|-------|
| korean phone number      | 17    |
| worst app ever           | 5     |
| need phone number        | 5     |
| food delivery service    | 4     |
| foreign credit card      | 4     |
| need korean phone        | 4     |
| get money back           | 3     |
| korean bank account      | 3     |
| alien registration card  | 3     |
| arc alien registration   | 3     |
| english eye english      | 2     |
| phone number set         | 2     |
| use non korean           | 2     |
| without korea phone      | 2     |
| foreign card work        | 2     |
| support apple pay        | 2     |
| credit card accepted     | 2     |
| contact customer service | 2     |
| eye english eye          | 2     |
| food discarded even      | 2     |




By applying N-gram analysis, I was able to better capture the context of word usage, which single keyword analysis alone could not fully reveal.

<ol>
<li><b>Dominant Issue: Foreigner Verification & Payment</b></li>
Key Words:
<ul>
<li>Bigram: phone number (43), credit card (18), korean phone (18), bank account (9), korean bank (8), apple pay (8), foreign card (9), payment card (5), without korean (5), tmoney card (5)</li>
<li>Trigram: korean phone number (17), need phone number (5), foreign credit card (4), need korean phone (4), korean bank account (3), alien registration card (3), arc alien registration (3), phone number set (2), without korea phone (2), foreign card work (2), support apple pay (2), credit card accepted (2)</li>
</ul>
Quantitative evidence shows that the biggest difficulty for users is the verification process requiring a Korean phone number. Payment failures due to lack of foreign credit cards or Korean bank accounts also emerge as a clear issue.

<li><b>Service Quality Issues</b></li>
Key Words:
<ul>
<li>Bigram: customer service (19), delivery driver (13), delivery service (11), food delivery (24), order delivery (11), food delivered (11), delivery apps (9), cancel order (9), delivery app (8), delivery guy (6), delivery time (6)</li>
<li>Trigram: worst app ever (5), food delivery service (4), get money back (3), contact customer service (2), food discarded even (2)</li>
</ul>
Even after completing verification and payment, users frequently express dissatisfaction with service quality, including customer service, delivery drivers, and delivery times.

<li><b>Foreigner-Specific Issues</b></li>
Key Words:
<ul>
<li>Bigram: need korean (6), without korean (5), able use (5)</li>
<li>Trigram: use non korean (2), without korea phone (2)</li>
</ul>
Foreign users experience inconvenience not only from language barriers but also from structural requirements such as needing a Korean phone number or bank account. N-gram analysis shows that the term "Korean" is more often associated with these requirements rather than just language support.

<li><b>Mentions of Specific Apps</b></li>
Key Words:
<ul>
<li>Bigram: coupang eats (11), uber eats (6)</li>
<li>Trigram: worst app ever (5)</li>
</ul>
Through bigram analysis, the word “Coupang,” which appeared in single keyword analysis, is revealed to specifically refer to the food delivery app <b>Coupang Eats</b>. Additionally, <b>Uber Eats</b>, which is widely used internationally, also appears. The frequent mentions of specific apps indicate their high market visibility and their role as benchmarks for user expectations.
</ol>


###1-2. Weighted Frequency Analysis
####1-2-1. Single Keyword Frequency Analysis

In [0]:
weighted_word_counts = words_exploded.withColumn("word", lower(col("word"))) \
    .groupBy("word") \
    .agg(
        sum(col("review_thumbsUpCount") + 1).alias("weighted_count")) \
    .orderBy(col("weighted_count").desc()) \
    .limit(100)


In [0]:
weighted_word_counts.show(100, truncate = False)

| word        | weighted_count |
|-------------|----------------|
| delivery    | 2201           |
| food        | 1797           |
| order       | 1506           |
| korea       | 1484           |
| korean      | 1474           |
| arc         | 1338           |
| phone       | 1277           |
| get         | 1240           |
| card        | 1200           |
| number      | 1198           |
| even        | 1118           |
| like        | 1062           |
| apps        | 1036           |
| app         | 1017           |
| time        | 929            |
| account     | 928            |
| restaurant  | 925            |
| one         | 854            |
| foreigner   | 813            |
| people      | 808            |
| thing       | 802            |
| review      | 786            |
| need        | 782            |
| back        | 763            |
| country     | 695            |
| go          | 668            |
| place       | 614            |
| use         | 602            |
| item        | 589            |
| way         | 588            |
| got         | 572            |
| know        | 562            |
| pay         | 556            |
| bank        | 555            |
| make        | 538            |
| money       | 510            |
| also        | 509            |
| find        | 504            |
| bad         | 499            |
| really      | 495            |
| service     | 474            |
| every       | 470            |
| work        | 466            |
| want        | 465            |
| connected   | 442            |
| still       | 436            |
| deal        | 429            |
| price       | 425            |
| shuttle     | 412            |
| sub         | 403            |
| registration| 400            |
| business    | 396            |
| register    | 389            |
| driver      | 389            |
| wanted      | 366            |
| tried       | 363            |
| visa        | 363            |
| city        | 362            |
| good        | 361            |
| visit       | 360            |
| without     | 357            |
| problem     | 354            |
| alien       | 350            |
| everything  | 348            |
| think       | 347            |
| day         | 346            |
| identity    | 343            |
| verify      | 339            |
| longterm    | 336            |
| wont        | 335            |
| week        | 335            |
| call        | 330            |
| thought     | 328            |
| come        | 317            |
| let         | 315            |
| many        | 314            |
| used        | 313            |
| isnt        | 310            |
| much        | 310            |
| plastic     | 307            |
| usually     | 306            |
| tip         | 305            |
| able        | 303            |
| great       | 300            |
| another     | 295            |
| spend       | 294            |
| see         | 290            |
| system      | 290            |
| tourist     | 284            |
| feel        | 284            |
| though      | 282            |
| might       | 282            |
| never       | 279            |
| may         | 278            |
| going       | 277            |
| two         | 275            |
| person      | 274            |
| first       | 274            |
| living      | 274            |
| big         | 274            |



####1-2-2. Bigram Analysis

In [0]:
weighted_bigrams_flat_df = bigrams_df.withColumn("keywords", explode(col("keywords_paired")))\
                            .groupBy("keywords")\
                            .agg(
                                sum(col("review_thumbsUpCount") + 1).alias("weighted_count")) \
                            .orderBy(col("weighted_count").desc()).limit(100)


In [0]:
weighted_bigrams_flat_df.show(20, truncate = False)

| keywords          | weighted_count |
|------------------|----------------|
| phone number      | 906           |
| food delivery     | 509           |
| need phone        | 348           |
| visit sub         | 344           |
| alien registration| 343           |
| arc alien         | 343           |
| registration card | 343           |
| number connected  | 340           |
| connected arc     | 340           |
| visa arc          | 338           |
| longterm visa     | 336           |
| apps longterm     | 336           |
| identity apps     | 336           |
| arc arc           | 336           |
| register verify   | 336           |
| arc food          | 336           |
| order register    | 336           |
| verify identity   | 336           |
| deal need         | 336           |
| delivery account  | 336           |



####1-2-3. Trigram Analysis

In [0]:

weighted_trigrams_flat_df = trigrams_df.withColumn("keywords", explode(col("keywords_paired")))\
                            .groupBy("keywords")\
                            .agg(
                                sum(col("review_thumbsUpCount") + 1).alias("weighted_count")) \
                            .orderBy(col("weighted_count").desc()).limit(100)

In [0]:
weighted_trigrams_flat_df.show(20, truncate = False)

| keywords                | weighted_count |
|-------------------------|----------------|
| need phone number       | 348           |
| alien registration card | 343           |
| arc alien registration  | 343           |
| connected arc alien     | 340           |
| number connected arc    | 340           |
| phone number connected  | 340           |
| arc arc food            | 336           |
| identity apps longterm  | 336           |
| deal need phone         | 336           |
| order register verify   | 336           |
| longterm visa arc       | 336           |
| verify identity apps    | 336           |
| registration card order | 336           |
| food delivery account   | 336           |
| visa arc arc            | 336           |
| card order register     | 336           |
| arc food delivery       | 336           |
| register verify identity| 336           |
| apps longterm visa      | 336           |
| korean phone number     | 299           |



By applying weighted keyword analysis using thumbs-up counts, we were able to identify which issues users are most strongly resonating with, beyond just frequency of mentions.

<ol>
<li><b>Dominant Issue: Verification & Payment</b></li>
Key Words:
<ul>
<li>N-grams: phone number (906), need phone (348), alien registration (343), registration card (343), number connected (340), visa arc (338), longterm visa (336), register verify (336), verify identity (336), delivery account (336), need phone number (348), alien registration card (343), arc alien registration (343), connected arc alien (340), number connected arc (340), phone number connected (340), order register verify (336), longterm visa arc (336), verify identity apps (336), register verify identity (336), korean phone number (299)</li>
<li>Single Keywords: korean (1474), arc (1338), phone (1277), card (1200), number (1198), account (928), bank (555), money (510), registration (400), identity (343), verify (339)</li>
</ul>
Quantitative evidence shows that verification and payment issues are the top concerns. In particular, challenges related to ARC (Alien Registration Card), identity verification, and using Korean bank accounts or cards are confirmed as the problems that users most strongly resonate with.

<li><b>Service Usage & Ordering Experience</b></li>
Key Words:
<ul>
<li>Single Keywords: time (929), item (589), service (474), shuttle (412), sub (403), driver (389)</li>
</ul>
Unlike simple frequency analysis, issues related to service usage and the ordering process do not appear as top concerns in the weighted analysis.

<li><b>Foreigner-Specific Issues</b></li>
Key Words:
<ul>
<li>Single Keywords: korean (1474), foreigner (813), tourist (284)</li>
</ul>
Issues specific to foreign users also do not carry as much weight when thumbs-up counts are applied. Interpretation may vary depending on whether "Korean" is read as language support or as referring to Korean-specific systems/items.

<li><b>Notable Points</b></li>
Key Words:
<ul>
<li>Single Keywords: shuttle (412), alternative left shuttle (200), plastic (307)</li>
</ul>
Unlike the single keyword analysis, mentions of Coupang Eats do not appear. Instead, Shuttle emerges as an alternative delivery solution. The appearance of "plastic" reflects user concern about plastic waste generated through food delivery apps.
</ol>


## 2. Sentiment Analysis
Now, I will conduct sentiment analysis to find: 
<ol>
<li> If the rates given by users actually match to text sentiment
<li> Current distribution between positive, neutral, and negative reviews
</ol> 

### 2-1. Users rate vs Text sentiment

In [0]:
# get review where user has given ratings
sql = """
SELECT review_rate,  
    review_content,
    review_sentences
FROM review_contents_temp_view 
WHERE review_rate IS NOT NULL AND YEAR(review_date) >= 2023
"""
review_rate_analysis_df = spark.sql(sql)



In [0]:
# 1. check schema
review_rate_analysis_df.printSchema()
print()
# 2. check column names
print(review_rate_analysis_df.columns)
print()
# 3. check data typue
print(review_rate_analysis_df.dtypes)
print()
# 4. check number of rows
print(review_rate_analysis_df.count())
print()
# 5. check statistical status
review_rate_analysis_df.describe().show()

# 6. check missing values
# get numeric columns
numeric_cols = [name for name, dtype in review_rate_analysis_df.dtypes if dtype in ('double', 'float', 'bigint')]
# numeric columns: count null values
for c in numeric_cols:
    null_count = review_rate_analysis_df.select(count(when(col(c).isNull() | isnan(c), c))).collect()[0][0]
    print(f"{c}: {null_count} nulls")



In [0]:
analyzer = SentimentIntensityAnalyzer()

In [0]:
def detect_text_sentiment(sentences):
  # if empty, return 0
  if not sentences:
    return 0
  # if sentence is a string, put it in an array
  if isinstance(sentences, str):
    sentences = [sentences]

  total_score = 0 
  for sentence in sentences:
    score = analyzer.polarity_scores(sentence)["compound"]
    total_score += score

  # get average of the sentiment score
  avg_score = total_score / len(sentences)

  if avg_score < 0:
    return "negative"
  elif avg_score > 0:
    return "positive"
  else: return "neutral"

detect_text_sentiment_udf = udf(detect_text_sentiment, StringType())

In [0]:
review_rate_analysis_df = review_rate_analysis_df.withColumn("classification_by_user_rate", 
                                                             when(col("review_rate") > 3, "positive")\
                                                            .when(col("review_rate") < 3, "negative")\
                                                            .otherwise("neutral"))

In [0]:
rate_detected_df = review_rate_analysis_df.withColumn("classification_by_model", detect_text_sentiment_udf(col("review_sentences")))

In [0]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# convert pyspark column to list 
y_true = [row['classification_by_user_rate'] for row in rate_detected_df.select('classification_by_user_rate').collect()]
y_pred = [row['classification_by_model'] for row in rate_detected_df.select('classification_by_model').collect()]

# calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)

# plot visualization
labels = sorted(list(set(y_true + y_pred)))   
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)

plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.savefig("./confusion_matrix.png")
plt.show()

![](./images/confusion_matrix.png)

**Findings** <br>
The VADER model performs well for detecting positive reviews, reasonably for negative reviews, but is not reliable for neutral reviews. Therefore, we should be aware that the negative reviews we analyze may not fully reflect all the negative reviews by users.
As with the word frequency analysis, I will continue to prioritize user-provided ratings to classify reviews as positive, neutral, or negative. When these ratings are not available, the VADER model will be used.


### 2-2. Current distribution of positive, neutral, and negative reviews

In [0]:
sentiment_marked_df = review_contents_df.withColumn("review_sentiment", \
        when(col("review_rate").isNull(), detect_text_sentiment_udf(col("review_sentences")))\
        .when(col("review_rate") < 3, "negative")\
        .when(col("review_rate") > 3, "positive")\
        .otherwise("neutral")   
    )
sentiment_df = sentiment_marked_df.groupBy("review_sentiment").count().toPandas()

labels = sentiment_df["review_sentiment"]
sizes = sentiment_df["count"]

plt.figure(figsize=(3, 3))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, colors=['lightcoral', 'lightblue', 'lightgreen'])
plt.title('Distribution of Reviews')
plt.axis('equal')
plt.show()

![](./images/pie_distribution.png)

In [0]:
# use the df used for word frequency analysis
weighted_sentiment_df = sentiment_marked_df.groupBy("review_sentiment")\
                                  .agg(
                                      sum(col("review_thumbsUpCount") + 1).alias("count"))\
                                  .toPandas()

labels = weighted_sentiment_df["review_sentiment"]
sizes = weighted_sentiment_df["count"]

plt.figure(figsize=(3, 3))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, colors=['lightcoral', 'lightblue', 'lightgreen'])
plt.title('Weighted Distribution of Reviews')
plt.axis('equal')
plt.show()

![](./images/pie_weighted_distribution.png)

**Findings**  
I analyzed the distribution of reviews based on user sentiment. Positive reviews account for approximately 64-66% of all reviews, negative reviews for about 19-20%, and the remaining are neutral. Comparing simple counts with weighted counts based on thumbs-up, there is no significant difference, indicating that all types of reviews receive roughly similar engagement from users.  

These results suggest that most users are satisfied with the app. However, attention should be focused on improving the experience of users who leave negative or neutral reviews, as addressing their concerns could further enhance overall user satisfaction and engagement.


## 3. Time-Series Analysis
Now, I will look into the trend of non-positive reviews over time by:
<ol>
<li> Calculate non-positive review ratio over time
<li> Plot trend line and reference baselines
</ol> 

### 3-1. Review trend over time
I will plot line graphs to visualize the trend of non-positive reviews over time. Since we observed no significant difference between simple and weighted counts in the overall review distribution, only simple counts will be used in this graph.

In [0]:
from pyspark.sql.functions import to_date
sentiment_date_df = sentiment_marked_df.withColumn("review_date", to_date(col("review_date")))

In [0]:
sentiment_date_agg_df = sentiment_date_df.groupBy("review_date", "review_sentiment").agg(
    count("*").alias("count"),
    sum(col("review_thumbsUpCount") + 1).alias("weighted_count")
)

In [0]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pyspark.sql.functions import trunc

# now the date is at day level -> aggregate to year-month level
# truc, month : return the first day of the given month
monthly_df = sentiment_date_agg_df.withColumn("year_month", trunc("review_date", "month"))

# grouping
sentiment_trend_df = monthly_df.groupBy("year_month") \
    .pivot("review_sentiment", ["positive", "negative", "neutral"]) \
    .agg(sum("count")).na.fill(0)\
    .orderBy("year_month")

pd_df = sentiment_trend_df.toPandas()

# create graph
fig, ax = plt.subplots(figsize= (15, 7))

# plot lines
# 1. positive review
ax.plot(pd_df['year_month'], 
        pd_df['positive'], 
        marker = 'o', 
        linestyle = '-', 
        label = 'Positive', 
        color = 'royalblue') # positive review
# 2. negative review
ax.plot(pd_df['year_month'], 
        pd_df['negative'], 
        marker = 'o', 
        linestyle = '-', 
        label = 'Negative', 
        color = 'tomato')  
# 3. neutral review
ax.plot(pd_df['year_month'], 
        pd_df['neutral'], 
        marker = 'o', 
        linestyle = '-', 
        label = 'Neutral', 
        color = 'grey') 
# 4. non-positive review
ax.plot(pd_df['year_month'], 
        pd_df['neutral']+pd_df['negative'], 
        marker = 'o', 
        linestyle = '--', 
        label = 'Non-positive', 
        color = 'black') 

ax.set_title('Monthly Review Sentiment Trend')
ax.set_xlabel('Month')
ax.set_ylabel('Number of Reviews')
ax.legend()
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# set x axis date format
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.xticks(rotation=45)
plt.tight_layout()  

# 5. 그래프 표시
plt.show()

**Findings** <br/>
The blue line shows positive reviews, the red line shows negative reviews, the gray line shows neutral reviews, and the black dashed line shows the number of non-positive reviews.
It is noticeable that there are certain times when the total number of reviews spikes. Given that positive (blue) and non-positive (black) reviews tend to move together, there may have been major events such as promotions or important feature updates triggering users to write reviews in April 2023, February 2024, May 2024, July 2024, and January 2025.
While this line graph effectively displays the trends of positive, negative, and neutral review counts over time, it is not the ideal way to observe the trend of non-positive reviews. Instead, I will add another line chart to show the ratio of non-positive reviews over time.

### 3-2. Non positive review ratio trend over time
To the graph, I would like to add a baseline to evaluate the trend.  
To determine an appropriate baseline, I first need to check whether the data follows a normal distribution.  
If the data is approximately normal, I will set the baseline using Mean + Standard Deviation as the upper bound and Mean - Standard Deviation as the lower bound.  
If the data is not normal, I will instead use the third quartile (Q3) as the baseline.  


In [0]:
# add negative and netural reviews
pd_df['non_positive'] = pd_df['negative'] + pd_df['neutral']
pd_df['all_reviews'] = pd_df['positive'] + pd_df['non_positive'] 
# if there is any review, calcaulte the ratio
# otherwise, populate np.nan (not 0, to prevent 'divided by zero')
pd_df['non_positive_ratio'] = np.where(
    pd_df['all_reviews'] > 0, pd_df['non_positive']/pd_df['all_reviews'],
    np.nan
)

# drop rows with na in ratio
plot_df = pd_df.dropna(subset=['non_positive_ratio']).copy()

In [0]:
data = plot_df["non_positive_ratio"]

# Check Skewness, Kurtosis
print("Skewness:", data.skew())     
print("Kurtosis:", data.kurt())     

# Shapiro-Wilk Test
stat, p = shapiro(data)
print(f"Shapiro-Wilk Test: stat={round(stat,2)}, p={round(p,2)}")

# D’Agostino and Pearson’s Test (scipy.stats.normaltest)
stat, p = normaltest(data)
print(f"D’Agostino and Pearson’s Test: stat={round(stat,2)}, p={round(p,2)}")

Skewness: close to 0 => symmetrical <br>
Kurtosis: a bit flat <br>
Shapiro-Wilk Test: p > 0.05 => normal distribution <br>
D Agostino and Pearson's Test: p > 0.85 => normal distrbution <br>

Therefore, I will use Mean + Standard Deviation as upper bound and Mean - Standard Deviation as the lower bound.


In [0]:
# convert date to numeric value to calculate trend line (linear regression)
x_numeric = mdates.date2num(plot_df['year_month'])
y_values = plot_df['non_positive_ratio']

# find slope and intercept
slope, intercept = np.polyfit(x_numeric, y_values, deg = 1) # deg = 1 : linear
# using the slope and intercept, calculate reference line
trend_y = slope * x_numeric + intercept

# calculate mean and std to set base line
ratio_mean = y_values.mean()
ratio_std = y_values.std()

upper_bound = ratio_mean + ratio_std
lower_bound = ratio_mean - ratio_std

# plot graph
fig, ax = plt.subplots(figsize = (15,7))
ax.plot(plot_df['year_month'], y_values, marker = 'o', linestyle = '-', label = "Monthly Ratio", color = "green")

# add reference line
ax.plot(plot_df['year_month'], trend_y, linestyle='--', linewidth= 1.2, label = "Reference line", color = "red")

# add base line
ax.axhline(y=upper_bound, color='orange', linestyle='--', linewidth=1.2, label='Upper Bound')
ax.axhline(y=lower_bound, color='orange', linestyle='--', linewidth=1.2, label='Lower Bound')

ax.set_title('Non-Positive Ratio Over Time', fontsize=16)
ax.set_xlabel('Month')
ax.set_ylabel('Ratio (Non-Positive / All Reviews)')
ax.legend()
ax.grid(True, which='both', linestyle='--', linewidth=0.5)
ax.set_ylim(bottom=0)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

**Findings**  
The green line shows the ratio of non-positive reviews, the yellow dashed lines represent the upper and lower bounds, and the red dashed line indicates the trend line.
Based on the linear regression analysis, the long-term trend line (red) for the non-positive review ratio remains almost flat with no meaningful slope. This suggests that over the past three years, the overall user experience has been consistent, staying around 34%.

To improve user experience, we should pay attention to the short-term fluctuations that can move this stagnant trend. In May–July 2023, September 2023, and June 2024, the non-positive review ratio rose sharply above the upper bound. This indicates that there were serious issues that caused user inconvenience. Analyzing the main complaints in the reviews from these periods could help identify the most urgent problems to address.

On the other hand, in February, August, and December 2023, as well as September–October 2024 and March 2025, the ratio of non-positive reviews quickly dropped below the lower bound. This suggests that there may have been successful updates, promotions, or other events that significantly boosted user satisfaction. Studying the positive reviews during these times can help identify the strongest points to maintain and further enhance.