# Retail: Loyalty Program Analysis Decomposition

<b>Define the goal:</b>
    
* What do you want to achieve and why?
  - We want to find any correllations between churn rate and the customer loyalty program
  - We want to find any correllations between sales numbers and the customer loyalty program

* Who's interested in what you produce?

  - People in charge of the loyalty program
  - Marketing department
* What decisions will be made based on your analysis?
  - Whether to invest more in the loyalty program or not
  - Whether to focus on discounts and membership benefits

<b>Specify details</b>

Task: To determine the probability that a customer will leave based on their behavior.

<b>Propose hypotheses:</b>

For instance, you could hypothesize that such customers:
* Customers who are not members of the loyalty program show lower growth dynamics than average for the sample.
* Customers who are not members of the loyalty program make payments less often than average.
* Customers who are not members of the loyalty program haven't bought anything for a long time.

<b>Action plan:</b>

Then it follows from the hypotheses that we need to:
* Look into the relationship between revenue growth and the probability of churn.
* Identify the relationship between purchase frequency and the probability of churn.
* Compare the time since the last purchase with the probability of churn.

<b>Description of the data</b>

The dataset contains data on purchases made at the building-material retailer Home World. All of its customers have membership cards. Moreover, they can become members of the store's loyalty program for $20 per month. The program includes discounts, information on special offers, and gifts. 

`retail_dataset_us.csv` contains:

- `purchaseId`
- `item_ID`
- `purchasedate`
- `Quantity` — the number of items in the purchase
- `CustomerID`
- `ShopID`
- `loyalty_program` — whether the customer is a member of the loyalty program

`product_codes_us.csv` contains:

- `productID`
- `price_per_one`

# 1. Download the data and read the general information

In [1]:
import pandas as pd

In [2]:
try:
    df= pd.read_csv('retail_dataset_us.csv')
    df_codes= pd.read_csv('product_codes_us.csv', sep= ';')
    
except:
    df= pd.read_csv('/datasets/retail_dataset_us.csv')
    df_codes= pd.read_csv('/datasets/product_codes_us.csv', sep= ';')

In [3]:
#general info
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105335 entries, 0 to 105334
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   purchaseid       105335 non-null  object 
 1   item_ID          105335 non-null  object 
 2   Quantity         105335 non-null  int64  
 3   purchasedate     105335 non-null  object 
 4   CustomerID       69125 non-null   float64
 5   loyalty_program  105335 non-null  int64  
 6   ShopID           105335 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 5.6+ MB


Unnamed: 0,Quantity,CustomerID,loyalty_program
count,105335.0,69125.0,105335.0
mean,7.821218,21019.302047,0.226345
std,327.946695,1765.444679,0.418467
min,-74216.0,18025.0,0.0
25%,0.0,19544.0,0.0
50%,2.0,20990.0,0.0
75%,7.0,22659.0,0.0
max,74214.0,23962.0,1.0


In [4]:
#preview of the dataset
df.head()

Unnamed: 0,purchaseid,item_ID,Quantity,purchasedate,CustomerID,loyalty_program,ShopID
0,538280,21873,11,2016-12-10 12:50:00,18427.0,0,Shop 3
1,538862,22195,0,2016-12-14 14:11:00,22389.0,1,Shop 2
2,538855,21239,7,2016-12-14 13:50:00,22182.0,1,Shop 3
3,543543,22271,0,2017-02-09 15:33:00,23522.0,1,Shop 28
4,543812,79321,0,2017-02-13 14:40:00,23151.0,1,Shop 28


In [5]:
df.tail()

Unnamed: 0,purchaseid,item_ID,Quantity,purchasedate,CustomerID,loyalty_program,ShopID
105330,538566,21826,1,2016-12-13 11:21:00,,0,Shop 0
105331,540247,21742,0,2017-01-05 15:56:00,21143.0,0,Shop 24
105332,538068,85048,1,2016-12-09 14:05:00,23657.0,1,Shop 16
105333,538207,22818,11,2016-12-10 11:33:00,18427.0,0,Shop 29
105334,543977,22384,9,2017-02-14 15:35:00,21294.0,0,Shop 19


In [6]:
#checking for missing values
df.isnull().sum()

purchaseid             0
item_ID                0
Quantity               0
purchasedate           0
CustomerID         36210
loyalty_program        0
ShopID                 0
dtype: int64

In [7]:
#checking for duplicated rows
df.duplicated().sum()

1033

In [8]:
df_codes.info()
df_codes.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3159 entries, 0 to 3158
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   productID      3159 non-null   object 
 1   price_per_one  3159 non-null   float64
dtypes: float64(1), object(1)
memory usage: 49.5+ KB


Unnamed: 0,price_per_one
count,3159.0
mean,2.954495
std,7.213808
min,0.0
25%,0.65
50%,1.45
75%,3.29
max,175.0


In [9]:
#preview of the dataset
df_codes.head()

Unnamed: 0,productID,price_per_one
0,10002,0.85
1,10080,0.85
2,10120,0.21
3,10123C,0.65
4,10124A,0.42


In [10]:
df_codes.tail()

Unnamed: 0,productID,price_per_one
3154,gift_0001_20,16.67
3155,gift_0001_30,25.0
3156,gift_0001_40,34.04
3157,gift_0001_50,42.55
3158,m,2.55


In [11]:
df_codes.isnull().sum()

productID        0
price_per_one    0
dtype: int64

In [12]:
df_codes.duplicated().sum()

0

# Step 1. Data Preprocessing

Based on our data preview above, here's what we might be able to do:


* Data preprocessing
  - Study missing values
  - Study type correspondence
  - Study duplicate values
  - Check the correctness of column names
  - Rename the columns
  - Remove duplicates
  - Convert types
  - Replace missing values
* we can see that there are a significant number of missing values in the 'CustomerID' column. We'll need to investigate why there are possible missing values, and whether or not we'll be able to salvage or drop that missing data. 

* We can also see that there are 1033 duplicate rows which we might be able to drop. We should also change column to the appropriate datatypes. For example: the 'CustomerID' column has values with a decimal when those can be removed. 

* We may also try to combine both datasets together. It's possible we might be able to combine them by 'item_ID' and 'productID' if they are showing the same information.

* We could also simplify the values in the 'ShopID' column to only show the number instead of having the word 'Shop' there

* We can also create a column for each customer who are members to show the total sum of payments from the $20 membership

# Step 2. Exploratory Data Analysis

Here's what we might be able to do:

* figure out what time period all of these purchases were made
* Next we can investigate the profile of members and non-members. We might be able to do this by:
  - calculating what percentage of customers are in the loyalty program, and customers who are not. 
  - organizing the stores by highest sales to lowest sales for members and non-members
  - checking to see if being a member increases your likely hood of purchase
  - looking at the member and non-member percentages of sales for each store
  - calculating mean sales overtime for total member and non-members
* It would also be helpful to show graphs of the above information
* find the date of the last purchase for each customer
* Use this data to split the customers into n categories.
* For each category, calculate the share of the customers who left.
* Within each category, define extra indicators (e.g. total sum of payments, total number of purchases).
* Draw conclusions: how time since the last purchase relates to customers' indices.
* Draw conclusions: how time since the last purchase relates to churn.
* Look into the relationship between revenue growth and the probability of churn.
* Identify the relationship between payment frequency and the probability of churn.
* Compare the time since the last purchase with the probability of churn.
* Performing Cohort Analysis


# Step 3. Statistical Data Analysis

Next we can formulate several hypothesis to show any correlations between member and non-members such as:

* Customers who are members of the loyalty program are likely not to churn
* Customers who are members of the loyalt program are likely to spend more money


# Step 4. Conclusions

From the above results we maybe able to make conclusions on whether or not the loyalty program is bringing in more revenue than non-members. Whether there is a correllation between higher sales and loyalty program membership.