<h1 style='text-align:center'><b>Navigating Negative Feedback: Strategies for Addressing Low Ratings in American Starbucks</b></h1>

# <b>1 Introduction</b>
Starbucks in the US received lots of reviews on CustomerAffair, but most of them are 1-star ratings. This worries the Customer Experience Manager because it affects customer satisfaction and how people see the brand. It's important to figure out why customers give low ratings so Starbucks can make things better, like improving service or product quality.
The manager knows customer reviews are not just about the numbers. S/he needs to understand what customers are saying in their reviews, especially the 1-star ones. This will help them see what's wrong and provide immediate solutions. To address this issue, this analysis uses data from ratings and review content to find common issues and trends among US Starbucks customers

<b>Dataset</b>: The dataset was taken from [Kaggle](https://www.kaggle.com/datasets/harshalhonde/starbucks-reviews-dataset).
- `Name`: The reviewer's name, if available.
- `Location`: The location or city associated with the reviewer, if provided.
- `Date`: The date when the review was posted.
- `Rating`: The star rating given by the reviewer, ranges from 1 to 5.
- `Review`: The textual content of the review, captures the reviewer's experience and opinions.
- `Image` Links: Links to images associated with the reviews, if available.

# <b>2 Data Preparation</b>

In [59]:
import pandas as pd                                         # For data wrangling
from scipy.stats import skew, kurtosis, kstest, shapiro     # For data distribution
import altair as alt                                        # For data visualization
alt.data_transformers.disable_max_rows()                    # For disabling max rows for datviz
from nltk.tokenize import word_tokenize                     # For word tokenization
import re                                                   # For specific criteria exclusion
from nltk.corpus import stopwords                           # for stopword removal (English)
from collections import Counter                             # For getting frequency of items
from nltk.collocations import BigramCollocationFinder       # For bigram extraction
from nltk.collocations import BigramAssocMeasures           # For evaluating word association
import matplotlib.pyplot as plt                             # For data visualization
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [60]:
# define function to inspect dataframe
def inspect_dataframe(df):
    summary = {
        'ColumnName': df.columns.values.tolist(),
        'Nrow': df.shape[0],
        'DataType': df.dtypes.values.tolist(),
        'NApercent': (df.isna().mean() * 100).round(2).tolist(),
        'Nduplicate': df.duplicated().sum(),
        'UniqueValue': df.nunique().tolist(),
        'Sample': [df[col].unique().tolist() for col in df.columns]
    }
    return pd.DataFrame(summary)

In [61]:
# specify colnames to include
cols = ['name', 'location', 'Date', 'Rating', 'Review']

# import dataset
df = pd.read_csv('../data/reviews_data.csv', usecols=cols)

# convert column name into lowercase
df.columns = df.columns.str.lower()

# display result
print(f'The dataframe contains {df.shape[0]} rows and {df.shape[1]} cols.')
inspect_dataframe(df)

The dataframe contains 850 rows and 5 cols.


Unnamed: 0,ColumnName,Nrow,DataType,NApercent,Nduplicate,UniqueValue,Sample
0,name,850,object,0.0,1,604,"[Helen, Courtney, Daynelle, Taylor, Tenessa, A..."
1,location,850,object,0.0,1,633,"[Wichita Falls, TX, Apopka, FL, Cranberry Twp,..."
2,date,850,object,0.0,1,741,"[Reviewed Sept. 13, 2023, Reviewed July 16, 20..."
3,rating,850,float64,17.06,1,5,"[5.0, 1.0, 2.0, 3.0, 4.0, nan]"
4,review,850,object,0.0,1,814,[Amber and LaDonna at the Starbucks on Southwe...


<b>Comment</b>: 
- There are 850 rows and four columns, namely `name`, `location`, `date`, `rating`, and `review`. 
- The column `date` should have been in the datetime format but due to the presence of the word 'Review', it is identified as an object. For this reason, this column should be cleaned and converted to datetime.
- Column `rating` contains a considerable number of missing values. To determine which missing value treatment is suitable, it is necessary to understand the mechanism of the missing data first.

In [62]:
# drop duplicates
df.drop_duplicates(inplace=True)

In [63]:
# check mechanism of missing values
df[df['rating'].isna()].loc[:, ['rating', 'review']].sample(10, random_state=42)

Unnamed: 0,rating,review
823,,No Review Text
723,,"On September 22, 2010 around 1:15pm, I went in..."
787,,I asked for half caf coffee at approximately 3...
802,,No Review Text
761,,This Starbucks repeatedly charges different pr...
716,,"On Friday, October 29, 2010, at around 9 pm I ..."
838,,No Review Text
770,,"While in the small Starbucks in Naples, the st..."
771,,"On 9/19/09, two gentlemen in front of me each ..."
722,,"Today, I checked my receipt after I got to wor..."


<b>Comment</b>: 
- The missing value is random and not related to other column, particularly `review`.
- Because the data distribution is not normal (custering on the lower end of distribution), we will use median to replace the missing values.

In [64]:
# check distribution: visual test
alt.Chart(df).mark_bar(size=50, color='#0b421a').encode(
    alt.X('rating:N', axis=alt.Axis(labelAngle=0)),
    alt.Y('count()', title='Count'),
).properties(
    title=alt.Title('The distribution of rating is positively skewed',
                    anchor='start',
                    fontSize=18,
                    offset=15),
    width=500, height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=12
)


In [65]:
# fill in missing values
df.fillna({'rating':df['rating'].median()}, inplace=True)

In [66]:
# double check missing values
df.isna().sum().reset_index(name='count')

Unnamed: 0,index,count
0,name,0
1,location,0
2,date,0
3,rating,0
4,review,0


<b>Comment</b>:
- The missing values have been handled by using median.

In [67]:
df['date'].head(10)

0    Reviewed Sept. 13, 2023
1     Reviewed July 16, 2023
2      Reviewed July 5, 2023
3      Reviewed May 26, 2023
4     Reviewed Jan. 22, 2023
5    Reviewed Sept. 14, 2023
6     Reviewed Sept. 8, 2023
7     Reviewed Aug. 25, 2023
8      Reviewed Aug. 5, 2023
9      Reviewed Aug. 4, 2023
Name: date, dtype: object

In [68]:
# remove reviewed in col date
df['date'] = df['date'].str.replace('Reviewed ', '')
# remove period in col Date
df['date'] = df['date'].str.replace('.', '')
# convert Date to datetime
df['date'] = pd.to_datetime(df['date'])
# extract year from date
df['year'] = df['date'].dt.year
# check result
df.head()

Unnamed: 0,name,location,date,rating,review,year
0,Helen,"Wichita Falls, TX",2023-09-13,5.0,Amber and LaDonna at the Starbucks on Southwes...,2023
1,Courtney,"Apopka, FL",2023-07-16,5.0,** at the Starbucks by the fire station on 436...,2023
2,Daynelle,"Cranberry Twp, PA",2023-07-05,5.0,I just wanted to go out of my way to recognize...,2023
3,Taylor,"Seattle, WA",2023-05-26,5.0,Me and my friend were at Starbucks and my card...,2023
4,Tenessa,"Gresham, OR",2023-01-22,5.0,I’m on this kick of drinking 5 cups of warm wa...,2023


<b>Comment</b>
- The column `date` initially contained "Reviewed ..." and commas so prior to the object to datetime conversion, they were removed.
- After the conversion has completed, the year can be extracted from the column `date`. This new column will be useful for investigating the trend of review scores over the years.
- Additionally, since the dataset is about reviews, we will add a new column, namely `len_review` to find out the review length per each entry.
- Another column, i.e., `state` will also be extracted from column location to understand further the location of the reviewer.

In [69]:
# function to extract state from location
def extract_state(location):
    if location[-2:].isupper():
        return location[-2:]
    else:
        return 'Other'

# apply the function to create a new column 'State'
df['state'] = df['location'].apply(extract_state)

# drop column location
df.drop(columns='location', inplace=True)

# check result
df.head(10)

Unnamed: 0,name,date,rating,review,year,state
0,Helen,2023-09-13,5.0,Amber and LaDonna at the Starbucks on Southwes...,2023,TX
1,Courtney,2023-07-16,5.0,** at the Starbucks by the fire station on 436...,2023,FL
2,Daynelle,2023-07-05,5.0,I just wanted to go out of my way to recognize...,2023,PA
3,Taylor,2023-05-26,5.0,Me and my friend were at Starbucks and my card...,2023,WA
4,Tenessa,2023-01-22,5.0,I’m on this kick of drinking 5 cups of warm wa...,2023,OR
5,Alyssa,2023-09-14,1.0,We had to correct them on our order 3 times. T...,2023,TX
6,ken,2023-09-08,1.0,I have tried Starbucks several different times...,2023,FL
7,Nikki,2023-08-25,1.0,Starbucks near me just launched new fall foods...,2023,NC
8,Alex,2023-08-05,1.0,"I ordered online for the Reisterstown Rd, St T...",2023,MD
9,Sunny,2023-08-04,1.0,Staff at the Smythe St. Superstore location in...,2023,Other


<b>Comment</b>
- We already added several columns to the dataframe.
- Before we perform further analysis, we need to clean up the text in column review. But please note, the following text cleaning will not perform stemming to retain specific information, e.g., personal pronouns and tenses.

In [70]:
# create function to clean text
def clean_text(text):
    # include word only
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # convert text to lowercase
    text = text.lower()
    return text

# apply function to clean the review
df['clean_review'] = df['review'].apply(clean_text)

# drop column review
df.drop(columns='review', inplace=True)

# check result
df[['clean_review']].sample(10, random_state=42)

Unnamed: 0,clean_review
512,we go to the hwy and dulles ave sugar land tx...
357,i like the way the set up is in the store and ...
110,i want to share my tragically experience on se...
684,im not sure if that is the correct date but it...
39,they wonder why they have to shut down locatio...
66,i went to starbucks today expecting to get wha...
756,my first day with the new gold card was so far...
260,this store has problems manhattan beach rosecr...
780,i ordered a caramel frappe with extra caramel ...
467,location state street new albany in barista ...


<b>Comment</b>: The review contents are now clean from punctuation marks and numbers.

In [71]:
# function to calculate word frequency for each review
def calculate_wordfrequency(text):
    tokens = word_tokenize(text)
    word_freq = len(tokens)
    return word_freq

# apply the function to the review
df['len_review'] = df['clean_review'].apply(calculate_wordfrequency)

# check result
df.head(10)

Unnamed: 0,name,date,rating,year,state,clean_review,len_review
0,Helen,2023-09-13,5.0,2023,TX,amber and ladonna at the starbucks on southwes...,59
1,Courtney,2023-07-16,5.0,2023,FL,at the starbucks by the fire station on in a...,101
2,Daynelle,2023-07-05,5.0,2023,PA,i just wanted to go out of my way to recognize...,70
3,Taylor,2023-05-26,5.0,2023,WA,me and my friend were at starbucks and my card...,84
4,Tenessa,2023-01-22,5.0,2023,OR,im on this kick of drinking cups of warm wate...,73
5,Alyssa,2023-09-14,1.0,2023,TX,we had to correct them on our order times the...,61
6,ken,2023-09-08,1.0,2023,FL,i have tried starbucks several different times...,45
7,Nikki,2023-08-25,1.0,2023,NC,starbucks near me just launched new fall foods...,101
8,Alex,2023-08-05,1.0,2023,MD,i ordered online for the reisterstown rd st th...,44
9,Sunny,2023-08-04,1.0,2023,Other,staff at the smythe st superstore location in ...,86


In [72]:
# export data for external data analysis
df.to_csv('../data/cleaned_data_reviews.csv', index=False)

# <b>3 Data Analysis</b>
### <b>3.1 Review Rating Scores</b>

In [73]:
# how many people give 1 rating
rating_df = df['rating'].value_counts().reset_index()
rating_df['proportion'] = (rating_df['rating']/sum(rating_df['rating'])*100).round(2)
rating_df

Unnamed: 0,index,rating,proportion
0,1.0,595,70.08
1,2.0,99,11.66
2,5.0,83,9.78
3,4.0,39,4.59
4,3.0,33,3.89


In [74]:
# how many people give 1 rating
alt.Chart(rating_df).mark_bar().encode(
    alt.X('proportion', title='Proportion'),
    alt.Y('rating:N', title='Rating', sort='-x'),
    tooltip=['rating', 'proportion'],
    color=alt.condition(
        alt.datum.rating == 1,
        alt.value('#0b421a'),  
        alt.value('lightgrey')) 
).properties(
    title=alt.Title('Majority of reviewers gave 1 star',
                    anchor='start',
                    fontSize=18,
                    offset=15),
    width=500, height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
    grid=True
)

In [75]:
# get number of reviews over years
rating_over_years = (df.groupby('year')
                     .agg(func={'clean_review':'count'})
                     .reset_index())

# visualize the trend
alt.Chart(rating_over_years).mark_line(color='#0b421a').encode(
    alt.X('year:N', title=None, axis=alt.Axis(labelAngle=0, tickCount=12, values=[y for y in range(2000, 2024, 4)])),
    alt.Y('clean_review', title='Number of Reviews')
).properties(
    title=alt.Title('Growing number of reviews between 2000 and 2023',
                    anchor='start',
                    fontSize=18,
                    offset=15),
                    width=600,
                    height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
    grid=True
).configure_view(
    stroke=None
)

In [76]:
df[df['rating'] == 1].groupby('year').agg(func={'clean_review':'count'}).reset_index().rename(columns={'clean_review':'count'})

Unnamed: 0,year,count
0,2000,1
1,2004,2
2,2005,1
3,2006,2
4,2007,3
5,2008,31
6,2009,52
7,2010,47
8,2011,22
9,2012,25


In [77]:
# how is the trend of 1 star
one_rating_df = (df[df['rating'] == 1]
                 .groupby('year')
                 .agg(func={'clean_review':'count'})
                 .reset_index()
                 .rename(columns={'clean_review':'count'}))

line_chart = alt.Chart(one_rating_df).mark_line(color='#0b421a').encode(
    alt.X('year:N', title=None, axis=alt.Axis(labelAngle=0, tickCount=12, values=[y for y in range(2000, 2024, 4)])), 
    alt.Y('count', title='Number of Reviews'),
    alt.Tooltip(['year', 'count'])
).properties(
    title=alt.Title('The frequency of one-star ratings peaked in 2015',
                    anchor='start',
                    fontSize=18,
                    offset=15),
    width=650, height=300)

median_value = one_rating_df['count'].median()
median_line = (alt.Chart(pd.DataFrame({'median_value': [median_value]}))
               .mark_rule(color='red', strokeDash=[10,5])
               .encode(y='median_value:Q')
)

# combine histogram with the median line
(line_chart + median_line).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
    grid=True
).configure_view(
    stroke=None
)

<b>Comment</b>
- The chart reveals fluctuating customer ratings over the years, with notable spikes such as the jump to 65 in 2015. 
- However, there are also declines, like the drop to 17 in 2013.

In [78]:
# check which state gave the 1 rating score
rating_state_df = (df[df['rating'] == 1]
                   .groupby('state')
                   .size()
                   .reset_index(name='count')
                   .sort_values(by='count', ascending=False)
                   .head(20))

alt.Chart(rating_state_df).mark_bar().encode(
    alt.X('state', sort='-y', axis=alt.Axis(labelAngle=0)), 
    alt.Y('count'),
    tooltip=['state', 'count'],
    color=alt.condition(
        alt.datum.state == 'CA',
        alt.value('#0b421a'),
        alt.value('lightgrey')) 
).properties(
    title=alt.Title('The 1 review scores mostly originated from California',
    anchor='start',
    fontSize=18,
    offset=15),
    width=600, height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
    grid=True
)

<b>Comment</b>
- California (CA) has the highest count of 1 rating, with 119 occurrences, followed by Florida (FL) with 32. 
- The distribution suggests dissatisfaction, as indicated by a rating of 1, varies across states.

In [79]:
# get names of customers in CA
(df[(df['state'] == 'CA') & (df['rating'] == 1)]
 .groupby('name')
 .size()
 .reset_index(name='count')
 .nlargest(10, 'count'))

Unnamed: 0,name,count
37,Jill,3
52,Linda,3
8,Becky,2
29,H,2
43,Katie,2
45,Kelly,2
46,Ken,2
85,Ryan,2
91,Steven,2
100,William,2


In [80]:
# get names of customers in CA
(df[df['state'] == 'CA']
 .groupby('name')
 .size()
 .reset_index(name='count')
 .nlargest(10, 'count'))

Unnamed: 0,name,count
80,Linda,5
58,Jill,3
67,Kathryn,3
14,Becky,2
34,David,2
44,H,2
62,Jose,2
68,Katie,2
71,Kelly,2
72,Ken,2


In [81]:
# check if linda is in other states
linda_distribution = (df[(df['name']=='Linda')]
 .groupby('state')
 .size()
 .reset_index(name='count'))
linda_distribution

Unnamed: 0,state,count
0,AZ,1
1,CA,5
2,CO,1
3,FL,1
4,HI,1
5,MN,1
6,SC,1
7,TN,1
8,VA,1


In [82]:
# prepare dataset
rating_name_df = (df[df['rating'] == 1]
                  .groupby('name')
                  .size()
                  .reset_index(name='count')
                  .nlargest(10, 'count'))

# visualize rating score by names
alt.Chart(rating_name_df).mark_bar().encode(
    alt.X('name', sort='-y', title='Customer Name', axis=alt.Axis(labelAngle=0)), 
    alt.Y('count', title='Count'),
    color=alt.condition(
        alt.datum.name == 'Linda',
        alt.value('#0b421a'),  
        alt.value('lightgrey')),
    tooltip=['name', 'count']
).properties(
    title=alt.Title('Linda is a customer with the highest number of low rating',
    anchor='start',
    fontSize=18,
    offset=15),
    width=600, height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
    grid=True
)

### <b>3.2 Content of Reviews</b>
#### <b>3.2.1 Review Length (in Words)</b>

In [83]:
# len review
one_review_df = (df[df['rating'] == 1]
                 .groupby('year')
                 .agg(func={'len_review':'median'})
                 .reset_index())

alt.Chart(one_review_df).mark_line(color='#0b421a').encode(
    alt.X('year:N', title=None, axis=alt.Axis(labelAngle=0, tickCount=12, values=[y for y in range(2000, 2024, 4)])), 
    alt.Y('len_review', title='Review Length (in words)')
).properties(
    title=alt.Title('Median of review length over years',
                    anchor='start',
                    fontSize=18,
                    offset=15),
    width=650, height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
    grid=True
)

<b>Comment</b>
- The data shows fluctuations in the length of reviews for 1-star ratings over the years, with notable peaks observed in 2009, 2013, and 2016, indicating periods of increased detail or expression of dissatisfaction in customer feedback. 
- However, from 2009 to 2020, the trend in the length of reviews for 1-star ratings appears to be relatively stable, with no significant upward or downward trend observed over this period. 
- This consistency suggests a consistent level of detail or expression of dissatisfaction in customer feedback during these years.

#### <b>3.2.2 Most Frequent Words</b>

In [84]:
def clean_text(text):
    # remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # remove numbers
    text = re.sub(r'\d+', '', text)
    # convert text to lowercase and tokenize words
    words = word_tokenize(text.lower())
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    # store to words if not stop words
    words = [word for word in words if word not in stop_words]
    # get frequency of words
    word_counts = Counter(words)
    return word_counts

# get low rating word frequency
low_rating = df[df['rating'] == 1]
low_rating_dict = low_rating['clean_review'].apply(clean_text).sum()
low_rating_df = (pd.DataFrame.from_dict(low_rating_dict, orient='index', columns=['count'])
                 .reset_index(names='word')
                 .sort_values(by='count', ascending=False))

In [85]:
# visualize most frequent words
alt.Chart(low_rating_df.head(10)).mark_bar().encode(
    x=alt.X('count', title='Frequency'),
    y=alt.Y('word', sort='-x'),
    tooltip=['word', 'count'],
    color=alt.condition(
        (alt.datum.word == 'starbucks') | (alt.datum.word == 'coffee'),
        alt.value('#0b421a'), 
        alt.value('lightgrey'))
).properties(
    title=alt.Title('Starbucks and coffee predominate the low rating comments',
    anchor='start',
    fontSize=18,
    fontWeight='bold',
    offset=15),
    width=550, 
    height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
)

<b>Comment</b>
- The words 'starbucks', 'coffee', 'drink', 'store', and 'order' are among the most frequent in customer reviews with a rating score of 1. This suggests that customers often express dissatisfaction related to aspects of their Starbucks experience, such as product quality ('coffee', 'drink'), service ('customer', 'store'), and ordering process ('order'). The repetition of these words indicates common themes of discontent among customers.
-  Words like 'customer', 'store', and 'order' suggest problems related to service delivery and customer experience. Customers may express frustration with long wait times, incorrect orders, or poor customer service.

In [86]:
# function to highlight text
def highlight_starbucks(text):
    return 'color: red' if 'starbucks' in text.lower() else ''

starbucks_sample = (low_rating[low_rating['clean_review']
                               .str.contains('starbucks')]
                               .loc[:, ['name', 'clean_review']]
                               .sample(5, random_state=42))

starbucks_sample.style.applymap(highlight_starbucks, subset=['clean_review'])

Unnamed: 0,name,clean_review
98,Ayako,corporate here in denver colorado honestly do not care about their loyal patrons on november i went to the hampden and locust location did a mobile order when i got there the store didnt open until am i requested for a refund and i was advised only the customer service can credit my account i sent snapshots of my order the starbucks rewards claimed that wasnt a valid receipt snapshots came from their mobile app if it wasnt a valid receipt why didnt they check it out since they are able to go and look at our previous orders they have looked at my previous orders many times where i did get my rewardsstars of course they could be lazy or dont give a care or both
253,Ellen,why do we continue to take the nonsense offered us by starbucks each establishment makes your drinks differently than the other what is the problem with the training i tried to get online and request points be credited to my account and the site kept responding a no to access so i changed the password and once again no to access of course after x locked out for hours really it is a coffee company reward card not a bank
386,Shanatta,this is located near where i stay decided to visit and go inside to purchase a drink normally just use drive through the employees there are very rude nasty the store reeked of cigarette smoke which was all in my clothes when i left the establishment as i stood there to place my order a worker behind the counter splashed water on myself and the cashier and all she could say was sorry i had a long day there is a reason i never dined in at this location i was so upset and took everything for me not to snatch her over that counter listen starbucks your drinks are not that great and overpriced so you would think you would work on your customer service better and at least keep your store clean the counters at this location were sticky nasty had trash on the counter man starbucks has really gone down i guess you start making money and everything else goes to hell
99,Karen,starbucks is taking advantage of their devoted customers and great employees prices roll out higher and higher every couple of months starbucks keeps taking customers money but will not pay their employees well no more starbucks for me price went up three times within a year and that was for an iced coffee with light cream only
700,Margaret,i ordered via starbucks coffee online i received and email that they were out of stock at the warehouse so they cancelled my order they left me a number i could call and place a replacement order


<b>Comment</b>
- The snippet of the reviews is all about the customers complaints about different themes such as poor service and refund issues (Ayako, Ellen, and Shanatta), inconsistent drink quality (Shanatta), hygene concerns (Shanatta), price increases (Karen), and cancellation issue (Margaret)

#### <b>3.2.3 Collocations of 'starbucks'</b>
Pointwise mutual information: to quantify the association between two terms occurring together (collocations)




$$\text{PMI}(w_1, w_2) = \log_2\left(\frac{P(w_1, w_2)}{P(w_1) \times P(w_2)}\right)$$
where:
- $w_1$ and $w_2$: word 1 and word 2
- $P(w_1, w_2)$: probability of co-occurrence of w1 and w2
- $P(w_1)$: probability of occurrece of w1
- $P(w_2)$: probability of occurrence of w2

In [87]:
# filter low rating (to include), location (to include), and containing unwanted text (to exclude)
low_rating = df[(df['rating'] == 1) & (df['state'] == 'CA') & ~(df['clean_review'].str.contains('no review text'))]

# combine filtered result
low_rating_combined = ' '.join(low_rating['clean_review'])

In [88]:
# define stop words (English only)
stopwords_ = set(stopwords.words('english'))

# filter words
words = [word.lower() for word in low_rating_combined.split() if word not in stopwords_]

# get collocations from words
finder = BigramCollocationFinder.from_words(words)

# calculate collocations based on pmi
bgm = BigramAssocMeasures()
collocations = {bigram: pmi for bigram, pmi in finder.score_ngrams(bgm.mi_like)}

# put the results into a df
collocation_df_new = pd.DataFrame(list(collocations.items()), columns=['collocation', 'pmi'])
# collapse collocation pairs using '_'
collocation_df_new["collocation"] = collocation_df_new["collocation"].apply(lambda x: '_'.join(x))
# filter only those containing 'starbucks'
filtered_collocation_df = (collocation_df_new[collocation_df_new['collocation']
                                              .apply(lambda x: 'starbucks' in x)]
                                              .nlargest(10, 'pmi'))

# show result
filtered_collocation_df

Unnamed: 0,collocation,pmi
2261,starbucks_stores,0.065104
2357,starbucks_gift,0.060096
2474,go_starbucks,0.054968
2582,starbuckscom_two,0.05
2690,went_starbucks,0.045
3119,card_starbuckscom,0.029412
3316,starbucks_reward,0.025
3650,contact_starbucks,0.016667
3692,starbucks_vallejo,0.016667
3699,uncomfortable_starbucks,0.016667


#### <b>3.2.4 Four-grams of Most Frequent Words

In [89]:
# filter out no review text from the dataframe
filtered_reviews = df[df['clean_review'] != 'no review text']

# concatenate all reviews into a single string
combined_text = ' '.join(filtered_reviews['clean_review'])

# create function for generating ngrams
def generate_ngrams(text, span):
    # Tokenize text and convert to lowercase
    words = word_tokenize(text.lower())  
    # Filter out non-alphanumeric tokens and single-letter words
    words = [word for word in words if re.match(r'^\w+$', word)]
    ngrams = [' '.join(words[i:i+span]) for i in range(len(words) - span + 1)]
    ngram_freq = {}
    for ngram in ngrams:
        ngram_freq[ngram] = ngram_freq.get(ngram, 0) + 1
    freq_list = list(sorted(ngram_freq.items(), key=lambda x: x[1], reverse=True))
    df = pd.DataFrame(freq_list, columns=["ngram", "frequency"])
    # Remove tuple brackets from n-grams
    df['ngram'] = df['ngram'].str.replace(r'\(|\)', '')
    return df

# generate ngrams
ngrams_df = generate_ngrams(combined_text, span=4)

# filter ngrams and get top-10
starbucks_ngrams = (ngrams_df[ngrams_df['ngram']
                              .str.contains('starbucks')]
                              .nlargest(10, 'frequency'))

# display results
starbucks_ngrams

Unnamed: 0,ngram,frequency
1,i went to starbucks,16
8,went to the starbucks,10
9,i go to starbucks,9
17,at the starbucks in,8
21,at the starbucks on,6
23,i called starbucks and,6
28,to the starbucks on,6
55,have been a starbucks,5
82,that starbucks does not,4
92,starbucks i have always,4


In [90]:
# check rows contain the specified 4-grams
df[(df['rating'] == 1) & (df['clean_review'].str.contains('i went to starbucks'))]

Unnamed: 0,name,date,rating,year,state,clean_review,len_review
41,Trenton,2022-11-29,1.0,2022,GA,so i went to starbucks at the kroger on lower ...,79
56,Aryelle,2022-09-29,1.0,2022,VA,i went to starbucks and asked for a vanilla an...,46
194,Mary,2019-04-24,1.0,2019,NY,i purchased a starbucks tumbler and used my st...,119
255,Jill,2018-03-21,1.0,2018,CA,on friday march th around pm i went to starbu...,141
400,Judith,2016-11-16,1.0,2016,IL,i went to starbucks online site and my passwor...,81
402,Grace,2016-10-24,1.0,2016,Other,i am a coffee lover i usually go to mcdonalds ...,119
478,raman,2015-08-13,1.0,2015,UT,i am very good customer for the starbucks i we...,41
546,Shanna,2014-12-10,1.0,2014,PA,i went to starbucks website november to order...,83
568,Andriy,2014-10-02,1.0,2014,NY,i went to starbucks in indiana state henry sch...,191
615,Emily,2013-09-23,1.0,2013,NY,yesterday i went to starbucks for what i thoug...,168


In [93]:
# function to highlight text
def highlight_text(text):
    return 'color: red' if 'i went to starbucks' in text else ''

# define df for searching
subset_starbucks = (df[(df['rating'] == 1) & (df['clean_review']
                                              .str.contains('i went to starbucks'))])

# sample rows containing the 4-grams
(subset_starbucks[['name', 'clean_review']]
 .sample(12, random_state=42)
 .style.applymap(highlight_text, subset=['clean_review']))

Unnamed: 0,name,clean_review
615,Emily,yesterday i went to starbucks for what i thought was going to be a relaxing cup of coffee i planned on meeting my friend there and having a relaxing time well it didnt go as planned because i parked in the handicapped spot no i didnt have a handicapped sticker and i was standing on line when a store associate in front of everyone stood in the middle of the store and loudly said is that your car in the handicapped spot and i responded yes he says can you move it i said no and i began explaining why when another store associate female began getting loud and telling me i had to move it i said i was not moving my car because they pick and choose when they want to enforce this rule and when its convenient for them i continued to say if youre going to enforce a rule it needs to be enforced all the time not just when its convenient to enforce
679,Alma,well i went to starbucks today at am on the guy took my order and he didnt give me my receipt and he threw it away without asking me if i wanted it are you serious and i ordered two grande mocha he didnt not fill it up it was halfway filled are you serious i dont remember the guys name
41,Trenton,so i went to starbucks at the kroger on lower fayetteville and there was this old man serving coffee there so i went up to there and asked for a tall americano and he asked my name and i said it but he said he didnt hear me so i said it louder so when i got my drink i was just sipping it until i looked at the receipt on my cup it said trenae worst service ever
687,Rania,today friday november i went to starbucks located at east market st leesburg va i ordered grande latte and i had to wait so long because the person who was doing the drinks did drinks for people who came after me and placed their order after me i told him that i placed my order before these people and what he did wasnt right he said to me in a very rude way ill get your drink and youll get the out of here he added starbucks doesnt want to have customers like you i was totally shocked by what he said and i asked that i want to speak to the store manager i explained to the store manager dennis what happened and i asked him to give me the corporate phone number the store manager didnt do anything about what happened in his store
402,Grace,i am a coffee lover i usually go to mcdonalds to have a cup of latte but today for a change i went to starbucks and ordered a hot cappuccino i sat there in starbucks and used the free wifi after minutes something i take my cappuccino with me as i decide to go back to my home when i am about to reach my home the cappuccino leaks from the lid my dress soaked with cappuccino its a bad experience for me since i always buy coffee from mcdonalds and it never happened to me before even though the coffee from starbucks are really expensive they are using very cheap cups i will never go to starbucks again
568,Andriy,i went to starbucks in indiana state henry schricker travel plaza store code in at am i ordered ice caramel macchiato cheese danish and juice naked for a total i paid by cash i gave to cashier her name her id bill only bill that i had she gave my change the change suppose to be i took money i started to count my change in front of her and told her that she gave me not enough that she suppose to give me more her answer kill me i gave you all change you lost it i told her that i was not going anywhere i was staying in front of her waiting for the order she said that if i cant handle my money it is my problem i was in shock then i asked manager the manager came right away she even did not asked what was the problem she took my change took all money from the cashier went on the back somewhere and few minutes later they came back gave me but took and told me that their cashier has no extra money everything is fine
194,Mary,i purchased a starbucks tumbler and used my starbucks card i put it away for summer drinks as oz of hot coffee would buzz me to bits last week i went to starbucks and had them put an iced coffee in there this week i made my own and i was drinking it and noticed that the cap didnt stay closed i drank and the next thing i knew i almost choked on a piece of plastic which fell into the tumbler it scared the life out of me choking hazard who knew i called starbucks and they said the only way to return it was with a receipt which i didnt keep since i used my starbucks card
56,Aryelle,i went to starbucks and asked for a vanilla and and the barista gave me what was left in the blender i asked for a new one and the barista charged me double for it do not go to jefferson davis highway bermuda square chester va
774,Sophie,this morning around h spain hour i went to starbucks to get a muffin that cost ob la rambla barcelona i gave the girl behind the counter a bill she said to me i dont have change so i asked the other cashier and she didnt have any either instead of offering me an apology she said it happens her answer was really rude and unprofessional it was only and theyre already out of change it would be just normal if they carried extra change because of their touristic location in spain starbucks are touristically sighted what if the friend i was with had had no change starbucks wouldve lost money because of that store and employee i would never have eaten my muffin what a shame
400,Judith,i went to starbucks online site and my password wasnt recognized i followed their directions and changed my password the site recognized the change until i used it had to change the password times finally i phoned the company their response your account has been closed for hrs i bought a tumbler last week collect only starbucks cups buy coffee to brew at home and stop in or days a week to buy coffee thank you starbucks youre a real winner


In [94]:
alt.Chart(starbucks_ngrams).mark_bar(color='#0b421a').encode(
    x=alt.X('frequency', title='Frequency'),
    y=alt.Y('ngram', sort='-x', title='Chunks'),
    tooltip=['ngram', 'frequency']
).properties(
    title=alt.Title('Word Frequency in Low Rating Comments',
    anchor='start',
    fontSize=18,
    fontWeight='bold'),
    width=400, height=300
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
)

<b>Comment</b>
- The frequent occurrence of n-grams like 'i went to starbucks', 'a cup of coffee', and 'i asked for a' suggests that customers are often describing specific incidents or interactions during their visits to Starbucks. These may include instances of receiving incorrect orders ('i asked for a'), encountering long wait times ('in front of me'), or dissatisfaction with the quality of products ('a cup of coffee').
- The repetition of these n-grams implies that multiple customers have had similar negative experiences, leading to the use of common phrases in their reviews. The n-grams 'i have been a loyal' and 'i would like to' may also indicate attempts by customers to express their dissatisfaction or suggest improvements to their experiences.

## <b>4 Conclusions</b>

<b>Summary</b>: In brief, this analysis seeks to investigate underlying factors contributing to the low ratings in American Starbucks. The analysis has shown that branches in California, the US, has the highest count of 1-rating score reviews. Typically, when customers write low rating reviews, they tend to use longer words to express their complaints. They often provide detailed description of their expreiences, frequenty mentioning words like 'went' and 'Starbucks'. Furthermore, the analysis has revealed that the main concerns revolve around product quality, service, and ordering process.

<b>Recommendation</b>: Based on the findings, we recommend the following actions for the Customer Satisfaction Manager at Starbucks: prioritize specific regions, especially in California; focus on improving product quality; enhance customer service standards for stores in the state to address service-related complaints; and streamline the ordering process to reduce customer frustrations.