# The Business Problem

FLO is a shoe retail chain store operating in Turkiye, which sells different kinds and brands of shoes to all ages and genders. FLO wishes to establish a roadmap for its sales and marketing activities. In order for the company to plan for the medium to long term, it is necessary to estimate the potential value that existing customers will provide to the company in the future.

# The Dataset

The dataset consists of information derived from the past shopping behaviors of *OmniChannel customers (used both online and offline channels for shopping)* who made their last purchases in 2020-2021.

* **master_id**: unique customer ID
* **order_channel** : the channel customer used for shopping (Android, ios, Desktop, Mobile, Offline)
* **last_order_channel** : last channel customer used for shopping
* **first_order_date** : The date of customers first purchase
* **last_order_date** : The date of customers last purchase
* **last_order_date_online** : The date of customers last purchase through online channels
* **last_order_date_offline** : The date of customers last purchase through offline channels
* **order_num_total_ever_online** : The total number of purchases made by the customer using online channels.
* **order_num_total_ever_offline** : The total number of purchases made by the customer using offline channels.
* **customer_value_total_ever_offline** : The total amount paid by the customer for offline purchases.
* **customer_value_total_ever_online** : The total amount paid by the customer for online purchases.
* **interested_in_categories_12** : The list of categories in which the customer has shopped in the last 12 months.


**Table of Contents**

1. [Setting up the Environment](#Section-one)
2. [Exploratory Data Analysis](#Section-two)
3. [Data Preparation](#Section-three)
4. [Preparation of CLTV Data Structure](#Section-four)
5. [Setting Up BG-NBD and Gamma-Gamma Models](#Section-five)
6. [Segmentation of Customers with CLTV](#Section-six)
7. [Functionalization of the Process](#Section-seven)
    

<a id="Section-one" ></a>
# Setting Up the Environment

In [None]:
!pip install lifetimes
import pandas as pd
import datetime as dt
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_period_transactions
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("/kaggle/input/flo-data-20k/flo_data_20k.csv")
df_ = df.copy()

<a id="Section-two" ></a>
# Exploratory Data Analysis

In [None]:
def check_df(dataframe):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(3))
    print("##################### Tail #####################")
    print(dataframe.tail(3))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
df.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T

<a id="Section-three" ></a>
# Data Preparation

**Processing Outliers**

There are outliers in the dataset. I used IQR methodology with custom thresholds of 0.01 and 0.99 in order to do minimal suppressions on outlier values.

In [None]:
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [None]:
def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = round(low_limit,0)
    dataframe.loc[(dataframe[variable] > up_limit), variable] = round(up_limit,0)

In [None]:
columns = ["order_num_total_ever_online", "order_num_total_ever_offline", "customer_value_total_ever_offline","customer_value_total_ever_online"]
df[columns].describe().T

In [None]:
for col in columns:
    replace_with_thresholds(df, col)

Checking if Outliers still persist:

In [None]:
df[columns].describe().T

There are still outliers but considering the case, it seems reasonable.

**Combining Online & Offline Variables**

***Omnichannel customers*** refer to customers who shop using both online and offline platforms. In order to do analysis of Omnichannel behavior, I creatied new variables for the total number of online & offline purchases and spending for each customer:

In [None]:
df["order_num_total"] = df["order_num_total_ever_online"] + df["order_num_total_ever_offline"]
df["customer_value_total"] = df["customer_value_total_ever_offline"] + df["customer_value_total_ever_online"]


Eliminating customers that do not provide any value (have 0 monetary or frequency scores):

In [None]:
df = df[~(df["customer_value_total"] == 0) | (df["order_num_total"] == 0)]

Converting date variables from object to datetime:

In [None]:
date_columns = df.columns[df.columns.str.contains("date")]
df[date_columns] = df[date_columns].apply(pd.to_datetime)

<a id="Section-four" ></a>
# Preparation of CLTV Data Structure

In [None]:
df["last_order_date"].max() # 2021-05-30
analysis_date = dt.datetime(2021,6,1)

In [None]:
cltv_df = pd.DataFrame()
cltv_df["customer_id"] = df["master_id"]
cltv_df["recency_cltv_weekly"] = df.apply(lambda x: (x.last_order_date - x.first_order_date).days / 7, axis=1)
cltv_df["T_weekly"] = df.apply(lambda x: (analysis_date - x.first_order_date).days / 7, axis=1)
cltv_df["frequency"] = df["order_num_total"]
cltv_df["monetary_cltv_avg"] = df["customer_value_total"] / df["order_num_total"]

cltv_df.describe().T

<a id="Section-five" ></a>
# Setting Up BG-NBD and Gamma-Gamma Models

**BG-NBD**

Each customer's expected number of weekly transactions:

In [None]:
bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(cltv_df['frequency'],
        cltv_df['recency_cltv_weekly'],
        cltv_df['T_weekly'])

#a: 0.00, alpha: 76.17, b: 0.00, r: 3.66

In [None]:
plot_period_transactions(bgf)

Prediction of expected purchases from customers within next 3 months:

In [None]:
cltv_df["exp_sales_3_month"] = bgf.predict(4*3,
                                       cltv_df['frequency'],
                                       cltv_df['recency_cltv_weekly'],
                                       cltv_df['T_weekly'])

cltv_df.sort_values("exp_sales_3_month",ascending=False)[:10]

Prediction of expected purchases from customers within next 6 months:

In [None]:
cltv_df["exp_sales_6_month"] = bgf.predict(4*6,
                                       cltv_df['frequency'],
                                       cltv_df['recency_cltv_weekly'],
                                       cltv_df['T_weekly'])

cltv_df.sort_values("exp_sales_6_month",ascending=False)[:10]

**Gamma-Gamma model**

Predicting the average value that customers will leave. 
What is the expected profit for each customer?

In [None]:
ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(cltv_df['frequency'],
        cltv_df['monetary_cltv_avg'])

cltv_df["exp_average_value"] = ggf.conditional_expected_average_profit(cltv_df['frequency'],
                                                                cltv_df['monetary_cltv_avg'])

cltv_df.head()

Calculating 6 months CLTV value:

In [None]:
cltv = ggf.customer_lifetime_value(bgf,
                                   cltv_df['frequency'],
                                   cltv_df['recency_cltv_weekly'],
                                   cltv_df['T_weekly'],
                                   cltv_df['monetary_cltv_avg'],
                                   time=6, #months
                                   freq="W", #recency & tenure
                                   discount_rate=0.01) # average discount rate on products
cltv_df["cltv"] = cltv

cltv_df.head()

<a id="Section-six" ></a>
# Segmentation of Customers with CLTV

In [None]:
cltv_df["cltv_segment"] = pd.qcut(cltv_df["cltv"], 4, labels=["D", "C", "B", "A"])
cltv_df.head()

Comparison of customer segments:

In [None]:
cltv_df.groupby("cltv_segment").agg({"exp_average_value":"mean",
                                     "cltv":"mean",
                                     "exp_sales_6_month":"mean",
                                     "frequency":"mean",
                                     "monetary_cltv_avg":"mean",
                                     "T_weekly":"mean",
                                     "recency_cltv_weekly":"mean"})

<a id="Section-seven" ></a>
# Functionalization of the process

In [None]:
def create_cltv_df(dataframe):

    # Data Preparation
    columns = ["order_num_total_ever_online", "order_num_total_ever_offline", "customer_value_total_ever_offline","customer_value_total_ever_online"]
    for col in columns:
        replace_with_thresholds(dataframe, col)

    dataframe["order_num_total"] = dataframe["order_num_total_ever_online"] + dataframe["order_num_total_ever_offline"]
    dataframe["customer_value_total"] = dataframe["customer_value_total_ever_offline"] + dataframe["customer_value_total_ever_online"]
    dataframe = dataframe[~(dataframe["customer_value_total"] == 0) | (dataframe["order_num_total"] == 0)]
    date_columns = dataframe.columns[dataframe.columns.str.contains("date")]
    dataframe[date_columns] = dataframe[date_columns].apply(pd.to_datetime)

    # Preparation of CLTV data structure
    dataframe["last_order_date"].max()  # 2021-05-30
    analysis_date = dt.datetime(2021, 6, 1)
    cltv_df = pd.DataFrame()
    cltv_df["customer_id"] = dataframe["master_id"]
    cltv_df["recency_cltv_weekly"] = ((dataframe["last_order_date"] - dataframe["first_order_date"]).astype('timedelta64[D]')) / 7
    cltv_df["T_weekly"] = ((analysis_date - dataframe["first_order_date"]).astype('timedelta64[D]')) / 7
    cltv_df["frequency"] = dataframe["order_num_total"]
    cltv_df["monetary_cltv_avg"] = dataframe["customer_value_total"] / dataframe["order_num_total"]
    cltv_df = cltv_df[(cltv_df['frequency'] > 1)]

    # Setting up BG-NBD Model
    bgf = BetaGeoFitter(penalizer_coef=0.001)
    bgf.fit(cltv_df['frequency'],
            cltv_df['recency_cltv_weekly'],
            cltv_df['T_weekly'])
    cltv_df["exp_sales_3_month"] = bgf.predict(4 * 3,
                                               cltv_df['frequency'],
                                               cltv_df['recency_cltv_weekly'],
                                               cltv_df['T_weekly'])
    cltv_df["exp_sales_6_month"] = bgf.predict(4 * 6,
                                               cltv_df['frequency'],
                                               cltv_df['recency_cltv_weekly'],
                                               cltv_df['T_weekly'])

    # # Setting Up Gamma-Gamma Model
    ggf = GammaGammaFitter(penalizer_coef=0.01)
    ggf.fit(cltv_df['frequency'], cltv_df['monetary_cltv_avg'])
    cltv_df["exp_average_value"] = ggf.conditional_expected_average_profit(cltv_df['frequency'],
                                                                           cltv_df['monetary_cltv_avg'])

    # CLTV Prediction
    cltv = ggf.customer_lifetime_value(bgf,
                                       cltv_df['frequency'],
                                       cltv_df['recency_cltv_weekly'],
                                       cltv_df['T_weekly'],
                                       cltv_df['monetary_cltv_avg'],
                                       time=6,
                                       freq="W",
                                       discount_rate=0.01)
    cltv_df["cltv"] = cltv

    # CLTV Segmentation
    cltv_df["cltv_segment"] = pd.qcut(cltv_df["cltv"], 4, labels=["D", "C", "B", "A"])

    return cltv_df
