<h2>What are the most frequently occurring words in negative reviews?</h2>

In [0]:
%pip install nltk

In [0]:
import nltk

In [0]:
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, lit, when, udf, explode, lower
from pyspark.sql.types import IntegerType, ArrayType, StringType

In [0]:
spark = SparkSession.builder.appName("analysis_painpoint").getOrCreate()

In [0]:
id_to_exclude = "https://www.reddit.com/r/korea/comments/1it9gty/exclusive_being_taken_prisoner_is_treason_in/"

In [0]:
appstore_df = spark.table("workspace.growth_poc.silver_appstore_reviews") \
                   .filter(year(col("updated")) >= lit(2023)) \
                   .select(
                       col("updated").alias("review_date"),
                       col("rating").alias("review_rate"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("thumbsUpCount").alias("review_thumbsUpCount"),
                       col("appName"),
                       col("country"),
                       col("language")
                   ) 
playstore_df = spark.table("workspace.growth_poc.silver_playstore_reviews") \
                    .filter(year(col("at")) >= lit(2023)) \
                    .select(
                       col("at").alias("review_date"),
                       col("score").alias("review_rate"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("thumbsUpCount").alias("review_thumbsUpCount"),
                       col("appName"),
                       col("language")
                   ) 
reddit_df = spark.table("workspace.growth_poc.silver_reddit_reviews")\
                    .filter((year(col("created_datetime")) >= lit(2023)) & (col("url") != id_to_exclude)) \
                    .select(
                       col("created_datetime").alias("review_date"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("score").alias("review_thumbsUpCount"),
                       col("language")
                   ) 
                    
review_contents_df = appstore_df.unionByName(playstore_df,allowMissingColumns=True) \
                                .unionByName(reddit_df,allowMissingColumns=True)


<h2>1. Simple Keyword Frequency Analysis

In [0]:
def detect_negative_review(sentence):
    if not sentence: # None or empty string
        return 0 
    score = analyzer.polarity_scores(sentence)["compound"]
    result = 1 if score < 0 else 0
    return result

In [0]:
# Use VADER for sentiment analysis
analyzer = SentimentIntensityAnalyzer()

# Mark negative reviews
# If review_rate is 3 or less than 3, mark as negative
# If review_rate is not available, use VADER to detect
detect_negative_review_udf = udf(detect_negative_review, IntegerType())
mark_negative_reviews_df = review_contents_df.withColumn("is_negative", \
    when(col("review_rate").isNull(), detect_negative_review_udf(col("review_content")))\
    .when(col("review_rate") <= 3, 1)\
    .otherwise(0)   
)

# get only negative reviews
negative_df = mark_negative_reviews_df.filter(col("is_negative") == 1)

# flatten words
words_exploded = negative_df.select(explode(col("review_words")).alias("word"))

# set lowercase and count 
word_counts = words_exploded.withColumn("word", lower(col("word"))) \
                            .groupBy("word").count() \
                            .orderBy(col("count").desc()).limit(100)

word_counts.show(100, truncate= False)

| word       | count |
|------------|-------|
| delivery   | 163   |
| order      | 156   |
| food       | 156   |
| korea      | 149   |
| korean     | 144   |
| get        | 134   |
| app        | 133   |
| time       | 132   |
| card       | 108   |
| one        | 103   |
| like       | 92    |
| even       | 92    |
| number     | 86    |
| use        | 85    |
| thing      | 83    |
| phone      | 73    |
| go         | 72    |
| restaurant | 70    |
| also       | 69    |
| people     | 66    |
| work       | 65    |
| make       | 65    |
| back       | 61    |
| service    | 61    |
| foreigner  | 60    |
| need       | 59    |
| place      | 59    |
| know       | 57    |
| year       | 56    |
| day        | 56    |
| country    | 55    |
| pay        | 54    |
| review     | 54    |
| problem    | 53    |
| take       | 51    |
| item       | 51    |
| way        | 50    |
| want       | 50    |
| good       | 50    |
| never      | 49    |
| driver     | 49    |
| every      | 48    |
| really     | 48    |
| account    | 46    |
| got        | 44    |
| still      | 43    |
| money      | 42    |
| think      | 42    |
| something  | 41    |
| police     | 41    |
| without    | 38    |
| come       | 38    |
| see        | 38    |
| going      | 38    |
| live       | 37    |
| used       | 37    |
| bad        | 37    |
| door       | 37    |
| lot        | 36    |
| much       | 36    |
| coupang    | 35    |
| u          | 35    |
| try        | 34    |
| since      | 34    |
| call       | 33    |
| many       | 33    |
| id         | 33    |
| english    | 32    |
| may        | 32    |
| say        | 32    |
| bank       | 32    |
| apps       | 32    |
| issue      | 32    |
| using      | 31    |
| first      | 31    |
| option     | 31    |
| ordered    | 31    |
| ever       | 31    |
| always     | 31    |
| customer   | 30    |
| said       | 30    |
| crime      | 30    |
| end        | 29    |
| payment    | 29    |
| foreign    | 29    |
| new        | 29    |
| could      | 29    |
| hour       | 28    |
| worst      | 28    |
| someone    | 28    |
| sure       | 28    |
| feel       | 28    |
| delivered  | 27    |
| tourist    | 27    |
| help       | 26    |
| seoul      | 26    |
| able       | 25    |
| two        | 25    |
| person     | 25    |
| tried      | 25    |


In [0]:
print(negative_df.count())

I extracted the 100 most frequently occuring words in 554 negative reviews and categorized those into three key problems. </br>
<ol>
<li><b>Foreigner-Specific Issues</b></li>
korean(155), english(33), foreigner(61), foreign(30), call(33), tourist(29)<br/>
This shows there are high possibilities of language-related or systemic difficulty for foreigners' to use the apps.

<li><b>Delivery Service Quality</b></li>
time(142), service(63), restaurant(70), door(37), item(52), driver(49)<br/>
This shows there are issues with user experience with the app usage, such as delivery time, restaurant service, delivery issue, communication with driver, etc.

<li><b>Payment & Verification</b></li>
card(108), use(89), number(86), phone(74), pay(54), money(43), account(46), id(38), bank(32), payment(29)<br/>
This shows there are difficulties with completing orders due to payment or verification issues. I assume the problems will be related to "foreign card" or "phone verification", etc. 

<li><b>Special Attention</b></li>
It is intersting the name of a specific app, <b>coupang</b>, is mentioned frequently (35 times) in the reviews. This may indicate coupang is considered as an alternative app to move on to when the users' current apps fail to satisfy them.


</ol>



<h2>2. N-gram Analysis

In [0]:
# take a list and return a list of tuples cotaining two words
def create_bigrams_from_list(words):
    if not words or len(words) < 2:
        return []
    bigrams_list = list(zip(words[:], words[1:], words[2:])) # zip stops when the short sized list meets the end
    # words[:]   = ['I',    'EAT',    'BANANA']
    # words[1:]  = ['EAT',  'BANANA']
    # => [('I', 'EAT'), ('EAT', 'BANANA')]

    # if you want to check trigrams..
    #bigrams_list = list(zip(words[:], words[1:], words[2:])) 

    bigrams = [" ".join(grams) for grams in bigrams_list]
    return bigrams
 
 # register the function as udf 
create_bigrams_udf = udf(create_bigrams_from_list, ArrayType(StringType()))
    

In [0]:
# get biagram result
bigrams_df = negative_df.withColumn("keywords_paired", 
                                    create_bigrams_udf(col("review_words"))) \
                        .select("keywords_paired")

In [0]:
# flatten keywords and aggregate (count)
bigrams_flat_df = bigrams_df.select(explode(col("keywords_paired")).alias("keywords"))\
                            .groupBy("keywords")\
                            .count()\
                            .orderBy(col("count").desc())

bigrams_flat_df.show(50)

**Bigram Analysis Results**
| keywords                 | count |
|--------------------------|-------|
| phone number             | 43    |
| food delivery            | 25    |
| customer service         | 19    |
| credit card              | 19    |
| korean phone             | 18    |
| delivery driver          | 13    |
| food delivered           | 12    |
| delivery service         | 11    |
| order delivery           | 11    |
| coupang eats             | 11    |
| police officer           | 11    |
| use app                  | 10    |
| bank account             | 10    |
| police chief             | 10    |
| gon na                   | 10    |
| first time               | 9     |
| korean bank              | 9     |
| cancel order             | 9     |
| foreign card             | 9     |
| delivery apps            | 9     |
| delivery app             | 8     |
| apple pay                | 8     |
| bank card                | 8     |
| feel like                | 8     |
| unscrupulous criminal... | 8     |
| end year                 | 8     |
| app ever                 | 7     |
| even though              | 7     |
| every time               | 7     |
| patriotic group          | 7     |
| every country            | 7     |
| delivery guy             | 6     |
| go back                  | 6     |
| front door               | 6     |
| money back               | 6     |
| speak korean             | 6     |
| delivery time            | 6     |
| need korean              | 6     |
| negative review          | 6     |
| police department        | 6     |
| order food               | 6     |
| worst app                | 6     |
| thing like               | 6     |
| make sure                | 6     |
| uber eats                | 6     |
| place live               | 6     |
| pc bang                  | 6     |
| able use                 | 5     |
| payment card             | 5     |
| new year                 | 5     |


In [0]:
bigrams_flat_df = bigrams_df.select(explode(col("keywords_paired")).alias("keywords"))\
                            .groupBy("keywords")\
                            .count()\
                            .orderBy(col("count").desc())

bigrams_flat_df.show(20)

**Trigram Analysis Result**
| keywords                   | count |
|----------------------------|-------|
| korean phone number        | 17    |
| worst app ever             | 5     |
| need phone number          | 5     |
| foreign credit card        | 5     |
| food delivery service      | 4     |
| korean bank account        | 4     |
| need korean phone          | 4     |
| police chief mr            | 4     |
| police chief isozaki       | 4     |
| get money back             | 3     |
| alien registration         | 3     |
| arc alien registration     | 3     |
| crime end year             | 3     |
| interview police           | 3     |
| english eye english        | 2     |
| phone number set           | 2     |
| use non korean             | 2     |
| without korea phone        | 2     |
| foreign card work          | 2     |
| support apple pay          | 2     |


By applying N-gram analysis method, I was able to find spot the users' complaints in more detail.

<ol>
<li><b>Dominant Issue: Foreigner Verification & Payment</b></li>
The most frequent n-grams overwhemlingly point to a single problem. Keywords like 'korean phone number (17)', 'phone number (43)', 'foreign credit card (5)', and 'korean bank account (4)' shows the biggest hurdles for foreign users are the mandatory Korean phone number verification and payment failures with foreign-issued cards.

<li><b>Core Service Quality</b></li>
Keywords like 'customer service(19)', 'delivery driver (13)', 'worst app ever (5)' mean fundamental problems with service quality itself.

<li><b>Coupang Eats</b></li>
The frequent mention of 'coupang eats (11)' is an important finding. It highlights the app's high market presence and its role as a benchmark for user expectations. When describing their experiences with food delivery services in general, users often reference the most well-known app as a point of comparison. Therefore, these mentions provide insight into the current market standards and what features or service levels users consider normal.
</ol>
