<h2>What are the most frequently occurring words in negative reviews?</h2>

In [0]:
%pip install nltk

In [0]:
import nltk

In [0]:
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, lit, when, udf, explode, lower
from pyspark.sql.types import IntegerType, ArrayType, StringType

In [0]:
spark = SparkSession.builder.appName("analysis_painpoint").getOrCreate()

In [0]:
id_to_exclude = ["https://www.reddit.com/r/korea/comments/1it9gty/exclusive_being_taken_prisoner_is_treason_in/", "https://www.reddit.com/r/korea/comments/11a53p7/colonial_police_warned_residents_about_police/"]

In [0]:
appstore_df = spark.table("workspace.growth_poc.silver_appstore_reviews") \
                   .filter(year(col("updated")) >= lit(2023)) \
                   .select(
                       col("updated").alias("review_date"),
                       col("rating").alias("review_rate"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("thumbsUpCount").alias("review_thumbsUpCount"),
                       col("appName"),
                       col("country"),
                       col("language")
                   ) 
playstore_df = spark.table("workspace.growth_poc.silver_playstore_reviews") \
                    .filter(year(col("at")) >= lit(2023)) \
                    .select(
                       col("at").alias("review_date"),
                       col("score").alias("review_rate"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("thumbsUpCount").alias("review_thumbsUpCount"),
                       col("appName"),
                       col("language")
                   ) 
reddit_df = spark.table("workspace.growth_poc.silver_reddit_reviews")\
                    .filter((year(col("created_datetime")) >= lit(2023)) & ~(col("url").isin(id_to_exclude))) \
                    .select(
                       col("created_datetime").alias("review_date"),
                       col("content_translated").alias("review_content"),
                       col("sentences").alias("review_sentences"),
                       col("words").alias("review_words"),
                       col("score").alias("review_thumbsUpCount"),
                       col("language")
                   ) 
                    
review_contents_df = appstore_df.unionByName(playstore_df,allowMissingColumns=True) \
                                .unionByName(reddit_df,allowMissingColumns=True)


<h2>1. Simple Keyword Frequency Analysis

In [0]:
def detect_negative_review(sentence):
    if not sentence: # None or empty string
        return 0 
    score = analyzer.polarity_scores(sentence)["compound"]
    result = 1 if score < 0 else 0
    return result

In [0]:
# Use VADER for sentiment analysis
analyzer = SentimentIntensityAnalyzer()

# Mark negative reviews
# If review_rate is 2 or less than 2, mark as negative
# If review_rate is not available, use VADER to detect
detect_negative_review_udf = udf(detect_negative_review, IntegerType())
mark_negative_reviews_df = review_contents_df.withColumn("is_negative", \
    when(col("review_rate").isNull(), detect_negative_review_udf(col("review_content")))\
    .when(col("review_rate") <= 2, 1)\
    .otherwise(0)   
)

# get only negative reviews
negative_df = mark_negative_reviews_df.filter(col("is_negative") == 1)

# flatten words
words_exploded = negative_df.select(explode(col("review_words")).alias("word"))

# set lowercase and count 
word_counts = words_exploded.withColumn("word", lower(col("word"))) \
                            .groupBy("word").count() \
                            .orderBy(col("count").desc()).limit(100)

word_counts.show(100, truncate= False)

| word       | count |
|------------|-------|
| delivery   | 162   |
| order      | 153   |
| food       | 149   |
| korea      | 144   |
| korean     | 138   |
| get        | 132   |
| time       | 131   |
| app        | 130   |
| card       | 101   |
| one        | 100   |
| even       | 92    |
| like       | 90    |
| use        | 82    |
| number     | 81    |
| thing      | 78    |
| phone      | 72    |
| go         | 72    |
| restaurant | 68    |
| people     | 63    |
| work       | 62    |
| also       | 62    |
| back       | 61    |
| foreigner  | 60    |
| service    | 60    |
| make       | 60    |
| need       | 58    |
| know       | 57    |
| place      | 57    |
| country    | 55    |
| pay        | 53    |
| problem    | 53    |
| day        | 53    |
| review     | 52    |
| item       | 51    |
| driver     | 49    |
| way        | 49    |
| want       | 49    |
| really     | 48    |
| never      | 47    |
| take       | 45    |
| account    | 45    |
| every      | 44    |
| still      | 43    |
| got        | 42    |
| something  | 41    |
| think      | 41    |
| year       | 39    |
| money      | 38    |
| come       | 38    |
| going      | 38    |
| used       | 37    |
| bad        | 37    |
| door       | 37    |
| lot        | 36    |
| much       | 36    |
| live       | 35    |
| without    | 34    |
| try        | 34    |
| good       | 34    |
| coupang    | 33    |
| id         | 33    |
| see        | 33    |
| call       | 32    |
| english    | 32    |
| many       | 32    |
| since      | 32    |
| u          | 32    |
| apps       | 32    |
| option     | 31    |
| ordered    | 31    |
| issue      | 31    |
| said       | 30    |
| ever       | 30    |
| customer   | 30    |
| first      | 30    |
| may        | 30    |
| say        | 30    |
| bank       | 30    |
| using      | 28    |
| hour       | 28    |
| worst      | 28    |
| payment    | 28    |
| always     | 28    |
| could      | 28    |
| tourist    | 27    |
| foreign    | 27    |
| sure       | 27    |
| feel       | 27    |
| delivered  | 26    |
| someone    | 26    |
| new        | 26    |
| able       | 25    |
| help       | 25    |
| tried      | 25    |
| thats      | 25    |
| minute     | 24    |
| leave      | 24    |
| well       | 24    |
| might      | 24    |
| tip        | 24    |


In [0]:
print(negative_df.count())

I extracted the 100 most frequently occuring words in 542 negative reviews and categorized those into three key problems. </br>
<ol>
<li><b>Foreigner-Specific Issues</b></li>
korean(138), english(32), foreigner(60), foreign(27), call(32), tourist(29)<br/>
This shows there are high possibilities of language-related or systemic difficulty for foreigners' to use the apps.

<li><b>Delivery Service Quality</b></li>
time(131), service(60), restaurant(68), door(37), item(51), driver(49), option(31), customer(30)<br/>
This shows there are issues with user experience with the app usage, such as delivery time, restaurant service, delivery issue, communication with driver, etc.

<li><b>Payment & Verification</b></li>
card(101), use(82), number(81), phone(72), pay(53), account(45), money(38), id(33), bank(30), payment(28)<br/>
This shows there are difficulties with completing orders due to payment or verification issues. I assume the problems will be related to "foreign card" or "phone verification", etc. 

<li><b>Special Attention</b></li>
The mention of the specific name "coupang (33)" is a unique point. Through N-gram analysis, it can be determined whether this refers to the Coupang company itself or to the Coupang Eats app.


</ol>



<h2>2. N-gram Analysis

In [0]:
# take a list and return a list of tuples cotaining two words
def create_bigrams_from_list(words):
    if not words or len(words) < 2:
        return []
    bigrams_list = list(zip(words[:], words[1:])) # zip stops when the short sized list meets the end
    # words[:]   = ['I',    'EAT',    'BANANA']
    # words[1:]  = ['EAT',  'BANANA']
    # => [('I', 'EAT'), ('EAT', 'BANANA')]

    # if you want to check trigrams..
    #bigrams_list = list(zip(words[:], words[1:], words[2:])) 

    bigrams = [" ".join(grams) for grams in bigrams_list]
    return bigrams
 
 # register the function as udf 
create_bigrams_udf = udf(create_bigrams_from_list, ArrayType(StringType()))
    

In [0]:
# get biagram result
bigrams_df = negative_df.withColumn("keywords_paired", 
                                    create_bigrams_udf(col("review_words"))) \
                        .select("keywords_paired")

In [0]:
# flatten keywords and aggregate (count)
bigrams_flat_df = bigrams_df.select(explode(col("keywords_paired")).alias("keywords"))\
                            .groupBy("keywords")\
                            .count()\
                            .orderBy(col("count").desc())

bigrams_flat_df.show(50,truncate = False)

**Bigram Analysis Results**
| keywords         | count |
|------------------|-------|
| phone number     | 43    |
| food delivery    | 25    |
| customer service | 19    |
| credit card      | 18    |
| korean phone     | 18    |
| delivery driver  | 13    |
| delivery service | 11    |
| order delivery   | 11    |
| food delivered   | 11    |
| coupang eats     | 11    |
| use app          | 10    |
| gon na           | 10    |
| delivery apps    | 9     |
| cancel order     | 9     |
| foreign card     | 9     |
| first time       | 9     |
| bank account     | 9     |
| apple pay        | 8     |
| delivery app     | 8     |
| feel like        | 8     |
| korean bank      | 8     |
| even though      | 7     |
| every time       | 7     |
| bank card        | 7     |
| app ever         | 7     |
| every country    | 7     |
| money back       | 6     |
| worst app        | 6     |
| uber eats        | 6     |
| order food       | 6     |
| go back          | 6     |
| front door       | 6     |
| delivery guy     | 6     |
| delivery time    | 6     |
| need korean      | 6     |
| place live       | 6     |
| pc bang          | 6     |
| online shopping  | 5     |
| korean food      | 5     |
| able use         | 5     |
| waste time       | 5     |
| delivery food    | 5     |
| app even         | 5     |
| thing like       | 5     |
| payment card     | 5     |
| make sure        | 5     |
| leave review     | 5     |
| order something  | 5     |
| one time         | 5     |
| thing korea      | 5     |



In [0]:
bigrams_flat_df = bigrams_df.select(explode(col("keywords_paired")).alias("keywords"))\
                            .groupBy("keywords")\
                            .count()\
                            .orderBy(col("count").desc())

bigrams_flat_df.show(20, truncate = False)

**Trigram Analysis Result**
| keywords                     | count |
|-------------------------------|-------|
| korean phone number           | 17    |
| worst app ever                | 5     |
| need phone number             | 5     |
| food delivery service         | 4     |
| foreign credit card           | 4     |
| need korean phone             | 4     |
| get money back                | 3     |
| korean bank account           | 3     |
| alien registration            | 3     |
| arc alien registration        | 3     |
| english eye english           | 2     |
| phone number set              | 2     |
| use non korean                | 2     |
| without korea phone           | 2     |
| foreign card work             | 2     |
| support apple pay             | 2     |
| credit card accepted          | 2     |
| contact customer service      | 2     |
| eye english eye               | 2     |
| food discarded even           | 2     |



By applying N-gram analysis, I was able to better capture the context of word usage, which single keyword analysis alone could not fully reveal.

<ol>
<li><b>Dominant Issue: Foreigner Verification & Payment</b></li>
Key Words:
<ul>
<li>Bigram: phone number (43), credit card (18), korean phone (18), bank account (9), korean bank (8), apple pay (8), payment card (5)</li>
<li>Trigram: korean phone number (17), need phone number (5), foreign credit card (4), need korean phone (4), korean bank account (3), alien registration card (3), arc alien registration (3), phone number set (2), without korea phone (2), foreign card work (2), support apple pay (2)</li>
</ul>
Quantitative evidence shows that the biggest difficulty for users is the verification process requiring a Korean phone number. Payment failures due to lack of foreign credit cards or Korean bank accounts also emerge as a clear issue.

<li><b>Service Quality Issues</b></li>
Key Words:
<ul>
<li>Bigram: customer service (19), delivery driver (13), delivery service (11)</li>
<li>Trigram: worst app ever (5), food delivery service (4), get money back (3), contact customer service (2)</li>
</ul>
Even after completing verification and payment, users frequently express dissatisfaction with service quality, including customer service, delivery drivers, and delivery times.

<li><b>Foreigner-Specific Issues</b></li>
Key Words:
<ul>
<li>Bigram: need korean (6)</li>
</ul>
Foreign users experience inconvenience related to language barriers. While I assumed the word "korean" was used in the context of lack of language support, N-gram analysis shows it was more often related to needing a Korean phone number of bank account.

<li><b>Mentions of Specific Apps</b></li>
Key Words:
<ul>
<li>Bigram: coupang eats (11), uber eats (6)</li>
</ul>
Through bigram analysis, the word “Coupang,” which appeared in single keyword analysis, is revealed to specifically refer to the food delivery app Coupang Eats. Additionally, Uber Eats, which is widely used internationally, also appears. The frequent mentions of specific apps indicate their high market visibility and their role as benchmarks for user expectations.
</ol>
