# Case Study - Customer Retention

<center><img src="https://1000logos.net/wp-content/uploads/2023/04/Starbucks-logo.png" height=216, width=384></img></center>

#### You are a dedicated data analyst at Starbucks in Kuala Lumpur, Malaysia, have noticed a recent fluctuation in customer loyalty among in-store patrons. Your goal is to not only enhance customer satisfaction as a factor of customer retention but also pave the way for increased revenue.

## Business Understanding

### Defining the Problem Statements

Using SMART framework:

1. **Specific**: Enhanching loyalty among existing customers especially in store.

2. **Measurable**: Achieve 90% customer retention rate.

3. **Achievable**: Increasing quality of product, service, in-store ambience along with the WIFI quality. Also, implement promos.

4. **Relevant**: Increasing the customer retention can lead to achieve higher revenue.

5. **Time-Bound**: Achieve within the next quarter.

`Problem statement`:


The goal is to enhance customer loyalty among existing in-store customers, aiming for a 90% customer retention rate within the next quarter. This will be achieved by improving product and service quality, enhancing in-store ambience, optimizing WIFI quality, and implementing strategic promotions, ultimately driving higher revenue.


### Breaking Down the Problem

Main problem: `improving customer retention as a metric of customer loyalty`

To ease our analysis and solve the problem, we need to understand the detail about the problem. To do it, we can use any framework to find the problem's root such as `5W+1H`, `Fish Bond Diagram`, etc. However, we will use `5W+1H`.

The `5W+1H`s:
- What factors that can lead to customer retention improvement?
- Who are people that be loyal to Starbucks?
- How will the quality of products and services be enhanced to improve customer retention?
- Why is improving customer retention important for company?
- etc

Moreover, we only focus to answer the top 3 questions to guide our analysis

## Data Understanding

### Basic Data Information

**Dataset Description**

> This dataset is composed of a survey questions of over 100 respondents for their buying behavior at Starbucks.
Income is show in Malaysian Ringgit (RM)

**Context**

> Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs

**Content**

> Demographic info about customers – gender, age range, employment status, income range
Their current behavior in buying Starbucks
Facilities and features of Starbucks that contribute to the behavior

**Columns**
```
Timestamp
1. Your Gender
2. Your Age
3. Are you currently....?
4. What is your annual income?
5. How often do you visit Starbucks?
6. How do you usually enjoy Starbucks?
7. How much time do you normally  spend during your visit?
8. The nearest Starbucks's outlet to you is...?
9. Do you have Starbucks membership card?
10. What do you most frequently purchase at Starbucks?
11. On average, how much would you spend at Starbucks per visit?
12. How would you rate the quality of Starbucks compared to other brands (Coffee Bean, Old Town White Coffee..) to be:
13. How would you rate the price range at Starbucks?
14. How important are sales and promotions in your purchase decision?
15. How would you rate the ambiance at Starbucks? (lighting, music, etc...)
16. You rate the WiFi quality at Starbucks as..
17. How would you rate the service at Starbucks? (Promptness, friendliness, etc..)
18. How likely you will choose Starbucks for doing business meetings or hangout with friends?
19. How do you come to hear of promotions at Starbucks? Check all that apply.
20. Will you continue buying at Starbucks?
```

In [1]:
import pandas as pd
from scipy import stats
import requests

Our data is from an API that you can access with this url: https:/p0w3casestudy.vercel.app/

As we learned on day 2, to fetch data from API we can use requests along with `get` endpoint. In addition, we need to check the data type of the result. Is it in json, string, or something else.

However, our result is dictionary, so we can directly convert into Pandas Data Frame.

In [2]:
url = "https://p0w3casestudy.vercel.app/"
result = requests.get(url)
print(type(result.json()))

<class 'dict'>


In [3]:
data = pd.DataFrame(result.json())
data.head()

Unnamed: 0,Timestamp,Gender,Age,Occupation,Annual Income,Visit,Behaviour,Time Spend,Distance,Membership,...,Money Spend,Quality Rate,Price Rate,Importance of Promo,Ambience Rate,WIFI Rate,Service Rate,Preference Rate,Promotion Channel,Buying Again
0,2019/10/01 12:38:43 PM GMT+8,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Dine in,Between 30 minutes to 1 hour,within 1km,Yes,...,Less than RM20,4,3,5,5,4,4,3,Starbucks Website/Apps;Social Media;Emails;Dea...,Yes
1,2019/10/01 12:38:54 PM GMT+8,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Take away,Below 30 minutes,1km - 3km,Yes,...,Less than RM20,4,3,4,4,4,5,2,Social Media;In Store displays,Yes
2,2019/10/01 12:38:56 PM GMT+8,Male,From 20 to 29,Employed,"Less than RM25,000",Monthly,Dine in,Between 30 minutes to 1 hour,more than 3km,Yes,...,Less than RM20,4,3,4,4,4,4,3,In Store displays;Billboards,Yes
3,2019/10/01 12:39:08 PM GMT+8,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Take away,Below 30 minutes,more than 3km,No,...,Less than RM20,2,1,4,3,3,3,3,Through friends and word of mouth,No
4,2019/10/01 12:39:20 PM GMT+8,Male,From 20 to 29,Student,"Less than RM25,000",Monthly,Take away,Between 30 minutes to 1 hour,1km - 3km,No,...,Around RM20 - RM40,3,3,4,2,2,3,3,Starbucks Website/Apps;Social Media,Yes


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122 entries, 0 to 121
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Timestamp            122 non-null    object
 1   Gender               122 non-null    object
 2   Age                  122 non-null    object
 3   Occupation           122 non-null    object
 4   Annual Income        122 non-null    object
 5   Visit                122 non-null    object
 6   Behaviour            121 non-null    object
 7   Time Spend           122 non-null    object
 8   Distance             122 non-null    object
 9   Membership           122 non-null    object
 10  Item Purchase        122 non-null    object
 11  Money Spend          122 non-null    object
 12  Quality Rate         122 non-null    int64 
 13  Price Rate           122 non-null    int64 
 14  Importance of Promo  122 non-null    int64 
 15  Ambience Rate        122 non-null    int64 
 16  WIFI Rate    

Our data consists of 21 columns and 122 rows. Pretty small for data analysis but it's okey for learning and initial approachment. We have inferential statistics, lol.

Unfortunately, we have missing values in our data which are in `Behaviour` and `Promotion Channel` columns. We will take out the row(s) later on.

## Data Preparation

Foretunately, our data is almost perfect to perform data analysis, so we only need to handle the missing values. According to our discussion above, we only need to take out the missing values.

However, how about the `Timestamp` column? do we need to convert the data type into `datetime64[ns]`? Hmm, it's not important to do that since the `Timestamp` relates to the record time in the databas. Remember that, our data actually is not time series data, each row represents one customer. Alternatively, we can takout the `Timestamp` from our data.

### Missing Values Handling

In [5]:
data = data.dropna()
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 121
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Timestamp            121 non-null    object
 1   Gender               121 non-null    object
 2   Age                  121 non-null    object
 3   Occupation           121 non-null    object
 4   Annual Income        121 non-null    object
 5   Visit                121 non-null    object
 6   Behaviour            121 non-null    object
 7   Time Spend           121 non-null    object
 8   Distance             121 non-null    object
 9   Membership           121 non-null    object
 10  Item Purchase        121 non-null    object
 11  Money Spend          121 non-null    object
 12  Quality Rate         121 non-null    int64 
 13  Price Rate           121 non-null    int64 
 14  Importance of Promo  121 non-null    int64 
 15  Ambience Rate        121 non-null    int64 
 16  WIFI Rate    

Clean from missing values. We only lost 1 row.

In [6]:
df = data.drop(columns='Timestamp')
df.head()

Unnamed: 0,Gender,Age,Occupation,Annual Income,Visit,Behaviour,Time Spend,Distance,Membership,Item Purchase,Money Spend,Quality Rate,Price Rate,Importance of Promo,Ambience Rate,WIFI Rate,Service Rate,Preference Rate,Promotion Channel,Buying Again
0,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Dine in,Between 30 minutes to 1 hour,within 1km,Yes,Coffee,Less than RM20,4,3,5,5,4,4,3,Starbucks Website/Apps;Social Media;Emails;Dea...,Yes
1,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Take away,Below 30 minutes,1km - 3km,Yes,Cold drinks;Pastries,Less than RM20,4,3,4,4,4,5,2,Social Media;In Store displays,Yes
2,Male,From 20 to 29,Employed,"Less than RM25,000",Monthly,Dine in,Between 30 minutes to 1 hour,more than 3km,Yes,Coffee,Less than RM20,4,3,4,4,4,4,3,In Store displays;Billboards,Yes
3,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Take away,Below 30 minutes,more than 3km,No,Coffee,Less than RM20,2,1,4,3,3,3,3,Through friends and word of mouth,No
4,Male,From 20 to 29,Student,"Less than RM25,000",Monthly,Take away,Between 30 minutes to 1 hour,1km - 3km,No,Coffee;Sandwiches,Around RM20 - RM40,3,3,4,2,2,3,3,Starbucks Website/Apps;Social Media,Yes


Created new dataframe without `Timestamp`

## Modeling/Exploratory Data Analysis

#### What factors that can lead to customer retention improvement?

To answer this question, we will perform chi-squared test since we want to test the relation between two categorical data. But, how about rating data? they are "numerical" data aren't they? Moreover, we will treat the rating data as categorical data to ease our works.

To perform the test, we need to define the hypothesis:

$H_0$: A variable and `buying again` are independent

$H_1$: A variable and `buying again` are dependent

We will use confidence level of 95%.

In [7]:
cols = data.drop(columns=['Timestamp','Promotion Channel','Buying Again']).columns

for col in cols:
  cross = pd.crosstab(data[col],data['Buying Again'])
  pval = stats.chi2_contingency(cross).pvalue
  prompt = "and they are dependent" if pval<0.05 else ""
  print(f"P-value of {col} and Buying Again: {pval} {prompt}")

P-value of Gender and Buying Again: 1.0 
P-value of Age and Buying Again: 0.6356871673042823 
P-value of Occupation and Buying Again: 0.1607051559293076 
P-value of Annual Income and Buying Again: 0.7438495801427483 
P-value of Visit and Buying Again: 0.0081671269342487 and they are dependent
P-value of Behaviour and Buying Again: 0.1968218965282764 
P-value of Time Spend and Buying Again: 0.07096869017305363 
P-value of Distance and Buying Again: 0.5201380402399829 
P-value of Membership and Buying Again: 0.0005711760401142383 and they are dependent
P-value of Item Purchase and Buying Again: 0.5439062916107664 
P-value of Money Spend and Buying Again: 0.00011367023704496573 and they are dependent
P-value of Quality Rate and Buying Again: 6.93854822279136e-05 and they are dependent
P-value of Price Rate and Buying Again: 1.45956480340212e-05 and they are dependent
P-value of Importance of Promo and Buying Again: 0.853345487868004 
P-value of Ambience Rate and Buying Again: 0.0061673913

Based on this, the factors that relate to customer retention are visit intensity, membership, money spend, product quality, price, ambience, service quality, and purpose of visit/preference.

However, the detail of which one leads the customers to be loyal, you need to explore each variable.

Let's we encode column `Buying Again` to make retention rate calculation easier.

In [8]:
df = data.copy()
df['Buying Again'] = df['Buying Again'].replace({'Yes':1,'No':0})
df['Buying Again'].head()

  df['Buying Again'] = df['Buying Again'].replace({'Yes':1,'No':0})


Unnamed: 0,Buying Again
0,1
1,1
2,1
3,0
4,1


To explore how the variable and retention relation works, we need to calculate the retention rate at first by measure the average of 'Buying Again'. The mean is the retention rate!

In [9]:
df.groupby('Visit')[['Buying Again']].mean()

Unnamed: 0_level_0,Buying Again
Visit,Unnamed: 1_level_1
Daily,1.0
Monthly,0.961538
Never,0.5
Rarely,0.710526
Weekly,1.0


We see that actually visit intensity to the store is not really affect the customer retention.

In [10]:
df.groupby('Membership')[['Buying Again']].mean()

Unnamed: 0_level_0,Buying Again
Membership,Unnamed: 1_level_1
No,0.639344
Yes,0.916667


Also, if customers has membership or not, they will keep to be loyal to Starbucks

In [11]:
df.groupby('Money Spend')[['Buying Again']].mean()

Unnamed: 0_level_0,Buying Again
Money Spend,Unnamed: 1_level_1
Around RM20 - RM40,0.933333
Less than RM20,0.706897
More than RM40,1.0
Zero,0.363636


People that spend their money to buy starcuks' products tend to be loyal and will buying again.

Rating data are actually ordinal data, so to measure the detail relation, we can use kendall correlation.

In [12]:
df.groupby('Quality Rate')[['Buying Again']].mean().reset_index()

Unnamed: 0,Quality Rate,Buying Again
0,1,0.0
1,2,0.25
2,3,0.74359
3,4,0.857143
4,5,0.913043


In [13]:
def corr(col):
  tmp = df.groupby(col)[['Buying Again']].mean().reset_index()
  tau, pval = stats.kendalltau(tmp[col],tmp['Buying Again'])
  print('Kendall Tau:',tau)
  print('P-value:',pval)

In [14]:
corr('Quality Rate')

Kendall Tau: 0.9999999999999999
P-value: 0.016666666666666666


In [15]:
corr('Price Rate')

Kendall Tau: 0.9999999999999999
P-value: 0.016666666666666666


In [16]:
corr('Ambience Rate')

Kendall Tau: 0.9486832980505137
P-value: 0.022977401503206065


In [17]:
corr('Service Rate')

Kendall Tau: 0.7999999999999999
P-value: 0.08333333333333333


In [18]:
corr('Preference Rate')

Kendall Tau: 0.9999999999999999
P-value: 0.016666666666666666


Based on those corr values, we know that retention rate has high correlation to customer satisfactory against product quality, price, ambience, service quality.

And because they are satisfy to Starbucks, they will come to buy the products again and willing to hangout with their friends or doing business meeting in Starbucks.

However, only Service rate that shows of p-value more than critical value, 0.05, so mostly, the correlations are occured not by chance randomly and we need more data to prove it. Moreover, we can still continue our analysis and this can be a warning that our conclusion maybe wrong.

#### Who are people that be loyal to Starbucks?

In [19]:
loyals = data[data['Buying Again']=='Yes']
loyals.head()

Unnamed: 0,Timestamp,Gender,Age,Occupation,Annual Income,Visit,Behaviour,Time Spend,Distance,Membership,...,Money Spend,Quality Rate,Price Rate,Importance of Promo,Ambience Rate,WIFI Rate,Service Rate,Preference Rate,Promotion Channel,Buying Again
0,2019/10/01 12:38:43 PM GMT+8,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Dine in,Between 30 minutes to 1 hour,within 1km,Yes,...,Less than RM20,4,3,5,5,4,4,3,Starbucks Website/Apps;Social Media;Emails;Dea...,Yes
1,2019/10/01 12:38:54 PM GMT+8,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Take away,Below 30 minutes,1km - 3km,Yes,...,Less than RM20,4,3,4,4,4,5,2,Social Media;In Store displays,Yes
2,2019/10/01 12:38:56 PM GMT+8,Male,From 20 to 29,Employed,"Less than RM25,000",Monthly,Dine in,Between 30 minutes to 1 hour,more than 3km,Yes,...,Less than RM20,4,3,4,4,4,4,3,In Store displays;Billboards,Yes
4,2019/10/01 12:39:20 PM GMT+8,Male,From 20 to 29,Student,"Less than RM25,000",Monthly,Take away,Between 30 minutes to 1 hour,1km - 3km,No,...,Around RM20 - RM40,3,3,4,2,2,3,3,Starbucks Website/Apps;Social Media,Yes
5,2019/10/01 12:39:39 PM GMT+8,Female,From 20 to 29,Student,"Less than RM25,000",Rarely,Dine in,Between 30 minutes to 1 hour,more than 3km,No,...,Less than RM20,4,3,5,5,4,5,4,Social Media,Yes


In [20]:
len(loyals)

94

There are 94 out of 121 people that willing to buy again Starbucks products

In [22]:
loyals['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Female,50
Male,44


Mostly female

In [24]:
loyals['Age'].value_counts()

Unnamed: 0_level_0,count
Age,Unnamed: 1_level_1
From 20 to 29,63
From 30 to 39,15
Below 20,10
40 and above,6


They are youngsters

In [25]:
loyals['Occupation'].value_counts()

Unnamed: 0_level_0,count
Occupation,Unnamed: 1_level_1
Employed,49
Student,28
Self-employed,15
Housewife,2


Mostly they have already employed

In [26]:
loyals['Annual Income'].value_counts()

Unnamed: 0_level_0,count
Annual Income,Unnamed: 1_level_1
"Less than RM25,000",53
"RM25,000 - RM50,000",20
"RM50,000 - RM100,000",15
"More than RM150,000",4
"RM100,000 - RM150,000",2


Mostly their salary not that big (less then RM 25000)

In [27]:
loyals['Visit'].value_counts()

Unnamed: 0_level_0,count
Visit,Unnamed: 1_level_1
Rarely,54
Monthly,25
Weekly,9
Never,4
Daily,2


They are not often to come to Starbucks

In [28]:
loyals['Behaviour'].value_counts()

Unnamed: 0_level_0,count
Behaviour,Unnamed: 1_level_1
Take away,40
Dine in,34
Drive-thru,17
never,1
I dont like coffee,1
Never,1


Mostly take away and dine-in

In [30]:
loyals['Time Spend'].value_counts()

Unnamed: 0_level_0,count
Time Spend,Unnamed: 1_level_1
Below 30 minutes,56
Between 30 minutes to 1 hour,26
Between 1 hour to 2 hours,11
Between 2 hours to 3 hours,1


They don't spend much time in store (perhaps becuase of most of them take away)

In [31]:
loyals['Membership'].value_counts()

Unnamed: 0_level_0,count
Membership,Unnamed: 1_level_1
Yes,55
No,39


Possessing the membership or not, it hasn't any difference

In [32]:
loyals['Money Spend'].value_counts()

Unnamed: 0_level_0,count
Money Spend,Unnamed: 1_level_1
Around RM20 - RM40,42
Less than RM20,41
More than RM40,7
Zero,4


Most of them willing to spend less than RM 20. Buying the cheapest ones

From this analysis, we can deduce that most of the loyal customers are young individuals with relatively low incomes, tending to be impulsive buyers and more inclined to purchase inexpensive products.

#### How will the quality of products and services be enhanced to improve customer retention?

In [33]:
#Products Quality
df.groupby('Quality Rate')[['Buying Again']].mean().reset_index()

Unnamed: 0,Quality Rate,Buying Again
0,1,0.0
1,2,0.25
2,3,0.74359
3,4,0.857143
4,5,0.913043


In [34]:
corr("Quality Rate")

Kendall Tau: 0.9999999999999999
P-value: 0.016666666666666666


In [35]:
#Products Quality
df.groupby('Service Rate')[['Buying Again']].mean().reset_index()

Unnamed: 0,Service Rate,Buying Again
0,1,0.0
1,2,0.5
2,3,0.785714
3,4,0.72549
4,5,0.956522


In [36]:
corr("Service Rate")

Kendall Tau: 0.7999999999999999
P-value: 0.08333333333333333


We've already know that higher product quality and service rate, can increasing the customer retention rate. This because more people satisfy with the products and service, more willing to come again and buy the products again.

## Conclusion

In conclusion, improving customer retention can be done by enhancing customer satisfactory. In aim to enhanching customer satisfactory, we can improve quality of service and products, improve the in-store ambience since our loyal customers also mostly come to dine-in. Also, Most our loyal customers are youngsters that has small salary, tend to be more consumptive, and willing to pay less.