# 1. Identifying the features
#### Use RFM scores for each customer ID for feature set. In order to implement this correctly, we should divide our dataset. We will select a period of 3 months' worth of data, calculate the RFM scores, and then utilize them to predict the following 6 months. Therefore, our first step is to create two dataframes and add the RFM scores to them.'

# 2. Importing necessary libraries and packages

In [17]:
#import libraries
from __future__ import division

from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
# from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# from sklearn.cluster import KMeans


import plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

# import xgboost as xgb
# from sklearn.model_selection import KFold, cross_val_score, train_test_split

# import xgboost as xgb

In [18]:
#read data from csv and redo the data work we done before
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [19]:
data = pd.read_csv('../data/customer_segmentation.csv', encoding='cp1252')

# 2.1 Feature Engineering

In [20]:
#converting the type of Invoice Date Field from string to datetime.
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

In [21]:
#creating YearMonth field for the ease of reporting and visualization
data['InvoiceYearMonth'] = data['InvoiceDate'].map(lambda date: 100*date.year + date.month)

In [22]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID,InvoiceYearMonth
count,541909.0,541909.0,406829.0,541909.0
mean,9.55225,4.611114,15287.69057,201099.713989
std,218.081158,96.759853,1713.600303,25.788703
min,-80995.0,-11062.06,12346.0,201012.0
25%,1.0,1.25,13953.0,201103.0
50%,3.0,2.08,15152.0,201107.0
75%,10.0,4.13,16791.0,201110.0
max,80995.0,38970.0,18287.0,201112.0


In [23]:
data['Country'].value_counts()

United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58
Lebanon 

#### From here on, our focus will be exclusively on UK data, which contains the highest number of records. To determine the monthly active customers, we will count unique CustomerIDs. The analysis can also be extended to customers from other countries.

In [24]:
data_uk = data.query("Country=='United Kingdom'").reset_index(drop=True)

## Segmentation Techniques:

### To achieve specific goals, various segmentation methods can be employed. For instance, if the aim is to improve retention rate, segmentation based on churn probability can be utilized to take appropriate actions. Additionally, there are common and valuable segmentation methods available. One such method we will implement is RFM, which stands for Recency - Frequency - Monetary Value. The theoretical segments are as follows:

* Low Value: Customers who are less active than others, not very frequent buyer/visitor and generates very low - zero - maybe negative revenue.

* Mid Value: In the middle of everything. Often using our platform (but not as much as our High Values), fairly frequent and generates moderate revenue.

* High Value: The group we don’t want to lose. High Revenue, Frequency and low Inactivity.

#### To implement RFM clustering, we need to calculate Recency, Frequency, and Monetary Value (referred to as Revenue) and employ unsupervised machine learning to identify distinct clusters for each segment. Let's proceed with coding and explore the process of RFM clustering.

# Recency
#### To calculate recency, we need to find out most recent purchase date of each customer and see how many days they are inactive for. After having no. of inactive days for each customer, we will apply K-means* clustering to assign customers a recency score.

In [26]:
#create a generic user dataframe to keep CustomerID and new segmentation scores
data_user = pd.DataFrame(data['CustomerID'].unique())
data_user.columns = ['CustomerID']
data_user.head()

Unnamed: 0,CustomerID
0,17850.0
1,13047.0
2,12583.0
3,13748.0
4,15100.0


In [27]:
data_uk.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceYearMonth
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,201012
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,201012
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,201012
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,201012
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,201012


#### Since we are calculating recency, we need to know when last the person bought something. Let us calculate the last date of transaction for a person.

In [28]:
#get the max purchase date for each customer and create a dataframe with it
data_max_purchase = data_uk.groupby('CustomerID').InvoiceDate.max().reset_index()
data_max_purchase.columns = ['CustomerID','MaxPurchaseDate']
data_max_purchase.head()

Unnamed: 0,CustomerID,MaxPurchaseDate
0,12346.0,2011-01-18 10:17:00
1,12747.0,2011-12-07 14:34:00
2,12748.0,2011-12-09 12:20:00
3,12749.0,2011-12-06 09:56:00
4,12820.0,2011-12-06 15:12:00


In [29]:
# Compare the last transaction of the dataset with last transaction dates of the individual customer IDs.
data_max_purchase['Recency'] = (data_max_purchase['MaxPurchaseDate'].max() - data_max_purchase['MaxPurchaseDate']).dt.days
data_max_purchase.head()

Unnamed: 0,CustomerID,MaxPurchaseDate,Recency
0,12346.0,2011-01-18 10:17:00,325
1,12747.0,2011-12-07 14:34:00,1
2,12748.0,2011-12-09 12:20:00,0
3,12749.0,2011-12-06 09:56:00,3
4,12820.0,2011-12-06 15:12:00,2


In [30]:
#merge this dataframe to our new user dataframe
data_user = pd.merge(data_user, data_max_purchase[['CustomerID','Recency']], on='CustomerID')
data_user.head()

Unnamed: 0,CustomerID,Recency
0,17850.0,301
1,13047.0,31
2,13748.0,95
3,15100.0,329
4,15291.0,25


# 3.1 Assigning a recency score
#### We are going to apply K-means clustering to assign a recency score. But we should tell how many clusters we need to K-means algorithm. To find it out, we will apply Elbow Method. Elbow Method simply tells the optimal cluster number for optimal inertia. Code snippet and Inertia graph are as follows:

In [32]:
from sklearn.cluster import KMeans

sse={} # error
data_recency = data_user[['Recency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_recency)
    data_recency["clusters"] = kmeans.labels_  #cluster names corresponding to recency values 
    sse[k] = kmeans.inertia_ #sse corresponding to clusters
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

ModuleNotFoundError: No module named 'sklearn'

#### Here it looks like 3 is the optimal one. Based on business requirements, we can go ahead with less or more clusters. We will be selecting 4 for this example

In [33]:
#build 4 clusters for recency and add it to dataframe
kmeans = KMeans(n_clusters=4)
data_user['RecencyCluster'] = kmeans.fit_predict(data_user[['Recency']])

NameError: name 'KMeans' is not defined

In [34]:
data_user.head()

Unnamed: 0,CustomerID,Recency
0,17850.0,301
1,13047.0,31
2,13748.0,95
3,15100.0,329
4,15291.0,25


In [35]:
datauser.groupby('RecencyCluster')['Recency'].describe()

NameError: name 'datauser' is not defined

# 3.2 Ordering clusters

In [36]:
#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

data_user = order_cluster('RecencyCluster', 'Recency',data_user,False)

KeyError: 'RecencyCluster'

In [37]:
data_user.head()

Unnamed: 0,CustomerID,Recency
0,17850.0,301
1,13047.0,31
2,13748.0,95
3,15100.0,329
4,15291.0,25


In [38]:
data_user.groupby('RecencyCluster')['Recency'].describe()

KeyError: 'RecencyCluster'

# 4. Frequency

In [None]:
#get order counts for each user and create a dataframe with it
data_frequency = data_uk.groupby('CustomerID').InvoiceDate.count().reset_index()
data_frequency.columns = ['CustomerID','Frequency']

In [None]:
data_frequency.head() #how many orders does a customer have

In [None]:
#add this data to our main dataframe
data_user = pd.merge(data_user, data_frequency, on='CustomerID')

data_user.head()

# 4.1 Frequency clusters
#### Determine the right number of clusters for K-Means by elbow method

In [39]:
from sklearn.cluster import KMeans

sse={} # error
data_recency = data_user[['Frequency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_recency)
    data_recency["clusters"] = kmeans.labels_  #cluster names corresponding to recency values 
    sse[k] = kmeans.inertia_ #sse corresponding to clusters
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# Applying k-Means
kmeans=KMeans(n_clusters=4)
data_user['FrequencyCluster']=kmeans.fit_predict(data_user[['Frequency']])

#order the frequency cluster
data_user = order_cluster('FrequencyCluster', 'Frequency', data_user, True )
data_user.groupby('FrequencyCluster')['Frequency'].describe()

# 5. Revenue
#### Let’s see how our customer database looks like when we cluster them based on revenue. We will calculate revenue for each customer, plot a histogram and apply the same clustering method.

In [None]:
#calculate revenue for each customer
data_uk['Revenue'] = data_uk['UnitPrice'] * data_uk['Quantity']
data_revenue = data_uk.groupby('CustomerID').Revenue.sum().reset_index()

In [None]:
data_revenue.head()

In [None]:
#merge it with our main dataframe
data_user = pd.merge(data_user, data_revenue, on='CustomerID')
data_user.head()

In [None]:
from sklearn.cluster import KMeans

sse={} # error
data_recency = data_user[['Revenue']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_recency)
    data_recency["clusters"] = kmeans.labels_  #cluster names corresponding to recency values 
    sse[k] = kmeans.inertia_ #sse corresponding to clusters
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

# 5.1 Revenue clusters

In [None]:
#apply clustering
kmeans = KMeans(n_clusters=4)
data_user['RevenueCluster'] = kmeans.fit_predict(data_user[['Revenue']])

#order the cluster numbers
data_user = order_cluster('RevenueCluster', 'Revenue',data_user,True)

#show details of the dataframe
data_user.groupby('RevenueCluster')['Revenue'].describe()

# Overall score based on RFM clustering
#### We have scores (cluster numbers) for recency, frequency & revenue. Let’s create an overall score out of them

In [None]:
#calculate overall score and use mean() to see details
data_user['OverallScore'] = data_user['RecencyCluster'] + data_user['FrequencyCluster'] + data_user['RevenueCluster']
data_user.groupby('OverallScore')['Recency','Frequency','Revenue'].mean()

#### Score 8 is our best customer, score 0 is our worst customer.

In [None]:
data_user['Segment'] = 'Low-Value'
data_user.loc[data_user['OverallScore']>2,'Segment'] = 'Mid-Value' 
data_user.loc[data_user['OverallScore']>4,'Segment'] = 'High-Value' 

In [None]:
data_user

# Customer Lifetime Value
#### Since our feature set is ready, let’s calculate 6 months LTV for each customer which we are going to use for training our model.
### Lifetime Value: Total Gross Revenue - Total Cost
#### There is no cost specified in the dataset. That’s why Revenue becomes our LTV directly.

In [None]:
data_uk.head()

In [None]:
data_uk['InvoiceDate'].describe()

#### We see that customers are active from 1 December 2010. Let us consider customers from March onwards (so that they are not new customers). We shall divide them into 2 subgroups. One will be where timeframe of analysing is 3 months, another will be timeframe of 6 months.

In [None]:
data_3m = data_uk[(data_uk.InvoiceDate < date(2011,6,1)) & (data_uk.InvoiceDate >= date(2011,3,1))].reset_index(drop=True) #3 months time
data_6m = data_uk[(data_uk.InvoiceDate >= date(2011,6,1)) & (data_uk.InvoiceDate < date(2011,12,1))].reset_index(drop=True) # 6 months time

In [None]:
#calculate revenue and create a new dataframe for it
data_6m['Revenue'] = data_6m['UnitPrice'] * data_6m['Quantity']
data_user_6m = data_6m.groupby('CustomerID')['Revenue'].sum().reset_index()
data_user_6m.columns = ['CustomerID','m6_Revenue']

In [None]:
data_user_6m.head()

In [None]:
#plot LTV histogram
plot_data = [
    go.Histogram(
        x=data_user_6m['m6_Revenue']
    )
]

plot_layout = go.Layout(
        title='6m Revenue'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

#### Histogram clearly shows we have customers with negative LTV. We have some outliers too. Filtering out the outliers makes sense to have a proper machine learning model.

In [None]:
data_user.head()

In [None]:
data_uk.head()

In [None]:
data_merge = pd.merge(data_user, data_user_6m, on='CustomerID', how='left') #Only people who are in the timeline of data_user_6m


In [None]:
data_merge = data_merge.fillna(0)

In [None]:
data_graph = data_merge.query("m6_Revenue < 50000") #because max values are ending at 50,000 as seen in graph above

plot_data = [
    go.Scatter(
        x=data_graph.query("Segment == 'Low-Value'")['OverallScore'],
        y=data_graph.query("Segment == 'Low-Value'")['m6_Revenue'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=data_graph.query("Segment == 'Mid-Value'")['OverallScore'],
        y=data_graph.query("Segment == 'Mid-Value'")['m6_Revenue'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=data_graph.query("Segment == 'High-Value'")['OverallScore'],
        y=data_graph.query("Segment == 'High-Value'")['m6_Revenue'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "6m LTV"},
        xaxis= {'title': "RFM Score"},
        title='LTV'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

#### We can visualise correlation between overall RFM score and revenue. Positive correlation is quite visible here. High RFM score means high LTV.

#### Before building the machine learning model, we need to identify what is the type of this machine learning problem. LTV itself is a regression problem. A machine learning model can predict the dollar value of the LTV. But here, we want LTV segments. #### #### Because it makes it more actionable and easy to communicate with other people. By applying K-means clustering, we can identify our existing LTV groups and build segments on top of it.

#### Considering business part of this analysis, we need to treat customers differently based on their predicted LTV. For this example, we will apply clustering and have 3 segments (number of segments really depends on your business dynamics and goals):

* Low LTV
* Mid LTV
* High LTV

#### We are going to apply K-means clustering to decide segments and observe their characteristics

In [None]:
#remove outliers
data_merge = data_merge[data_merge['m6_Revenue']<data_merge['m6_Revenue'].quantile(0.99)]

In [None]:
data_merge.head()

In [None]:
#creating 3 clusters
kmeans = KMeans(n_clusters=3)
data_merge['LTVCluster'] = kmeans.fit_predict(data_merge[['m6_Revenue']])

data_merge.head()

In [None]:
#order cluster number based on LTV
data_merge = order_cluster('LTVCluster', 'm6_Revenue',data_merge,True)

#creatinga new cluster dataframe
data_cluster = data_merge.copy()

#see details of the clusters
data_cluster.groupby('LTVCluster')['m6_Revenue'].describe()

#### We have finished LTV clustering and here are the characteristics of each clusters as shown above.

#### Cluster 2 is the best with average 8.2k LTV whereas 0 is the worst with 396.

#### There are few more step before training the machine learning model:
* Feature engineering.
* Convert categorical columns to numerical columns.
* We will check the correlation of features against our label, LTV clusters.
* We will split our feature set and label (LTV) as X and y. We use X to predict y.
* Will create Training and Test dataset. Training set will be used for building the machine learning model. We will apply our model to Test set to see its real performance.

In [None]:
data_cluster.head()

# 7.1 Feature Engineering

In [None]:
#convert categorical columns to numerical
data_class = pd.get_dummies(data_cluster) #There is only one categorical variable segment
data_class.head()

In [None]:
#calculate and show correlations
corr_matrix = data_class.corr()
corr_matrix['LTVCluster'].sort_values(ascending=False)

In [None]:
#create X and y, X will be feature set and y is the label - LTV
X = data_class.drop(['LTVCluster','m6_Revenue'],axis=1)
y = data_class['LTVCluster']

#split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)

#### We see that Revenue, Frequency and RFM scores will be helpful for our machine learning models from the correlation with LTV Cluster.

# 8. Machine Learning Model for Customer Lifetime Value Prediction
#### Since our LTV Clusters are 3 types, high LTV, mid LTV, and low LTV; we will perform multi class classification

In [None]:
#XGBoost Multiclassification Model
ltv_xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1,n_jobs=-1).fit(X_train, y_train)

print('Accuracy of XGB classifier on training set: {:.2f}'
       .format(ltv_xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
       .format(ltv_xgb_model.score(X_test[X_train.columns], y_test)))

y_pred = ltv_xgb_model.predict(X_test)

#### Accuracy looks good on training and test set. Let's check the precision, recall, fscore too

In [None]:
print(classification_report(y_test, y_pred))

# Final Clusters for Customer Lifetime Value

* Cluster 0: Good precision, recall, f1-score and support
* Cluster 1: Needs better precision, recall and f1-score
* Cluster 2: Bad precision, F1-Score needs improvement

#### If model tells us this customer belongs to cluster 0, 93 out of 100 will be correct (precision). And the model successfully identifies 95% of actual cluster 0 customers (recall).

#### We really need to improve the model for other clusters. For example, we barely detect 67% of Mid LTV customers.

#### Possible actions to improve performance

* Adding more features and improve feature engineering
* Try different models other than XGBoost
* Apply hyper parameter tuning to current model
* Add more data to the model if possible