In [None]:
# # import libraries
# import pandas as pd
# import numpy as np

In [1]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

# move directory
import os
colab_dir = "./drive/MyDrive/"
os.chdir(colab_dir)

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib_inline
%matplotlib inline

# set random seed
import random
random.seed(335)

# magic word
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# for better viz
import pprint
import warnings
warnings.filterwarnings('ignore')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# read excel file
df = pd.read_excel("Colab Notebooks/[DM] s1/Online Retail.xlsx")

# Remove register without CustomerID
df = df[~(df.CustomerID.isnull())]

# Remove negative or return transactions
df = df[~(df.Quantity<0)]
df = df[df.UnitPrice>0]

# transformation to the necessary datatypes
df.InvoiceDate = pd.to_datetime(df.InvoiceDate)
df.CustomerID = df.CustomerID.astype('Int64')

# Amount
df['amount'] = df.Quantity*df.UnitPrice

# Days since Last Purchase
import datetime
refrence_date = df.InvoiceDate.max() + datetime.timedelta(days = 1)
df['days_since_last_purchase'] = (refrence_date - df.InvoiceDate).astype('timedelta64[D]')

# Recency
customer_history_df = df[['CustomerID', 'days_since_last_purchase']].groupby("CustomerID").min().reset_index()
customer_history_df.rename(columns={'days_since_last_purchase':'recency'}, inplace=True)

# Frequency
customer_freq = (df[['CustomerID', 'InvoiceNo']].groupby(["CustomerID", 'InvoiceNo']).count().reset_index()).groupby(["CustomerID"]).count().reset_index()
customer_freq.rename(columns={'InvoiceNo':'frequency'},inplace=True)
customer_history_df = customer_history_df.merge(customer_freq)

# Monetary Value
customer_monetary_val = df[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
customer_history_df = customer_history_df.merge(customer_monetary_val)

import math
from sklearn import preprocessing

# transform the variables on the log scale 
# (solves the problem with a huge range of values)
customer_history_df['recency_log'] = customer_history_df['recency'].apply(math.log)
customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
customer_history_df['amount_log'] = customer_history_df['amount'].apply(math.log)

# standardization (necessary for K-means)
feature_vector = ['amount_log', 'recency_log','frequency_log']
X_subset = customer_history_df[feature_vector] #.as_matrix()
scaler = preprocessing.StandardScaler().fit(X_subset)
X_scaled = scaler.transform(X_subset)

In [3]:
# build and train the model with the optimal parameters
from sklearn.cluster import KMeans

features = ['amount',  'recency',  'frequency']

clusterer = KMeans(
    n_clusters=7,
    init='k-means++', 
    random_state=101
)

cluster_labels = clusterer.fit_predict(X_scaled)

# evaluation
---------------------
At this stage in the project you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

## evaluate result
----------

### task

Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Another option of evaluation is to test the model(s) on test applications in the real application if time and budget constraints permit.

Moreover, evaluation also assesses other data mining results generated. Data mining results cover models which are necessarily related to the original business objectives and all other findings which are not necessarily related to the original business objectives but might also unveil additional challenges, information or hints for future directions.

### output

#### assessment of data mining results with respect to business success criteria 

Summarize assessment results in terms of business success criteria including a final statement whether the project already meets the initial business objectives.

#### approved models 

After model assessment with respect to business success criteria, the gen-erated models that meet the selected criteria become approved models.


### Assessment of Data Mining Results with Respect to Business Success Criteria

In [4]:
print("Centers of each cluster:")
cent_transformed = scaler.inverse_transform(clusterer.cluster_centers_)
print(pd.DataFrame(np.exp(cent_transformed),columns=features))

Centers of each cluster:
         amount     recency  frequency
0    205.257565  225.386840   1.083954
1   2447.431580   38.058360   6.037971
2    810.639899  107.621358   2.278104
3    678.481625   13.838593   2.689984
4   2123.643873    4.419742   6.441601
5    240.091620   36.207714   1.134226
6  10213.306074    4.904358  20.795906


The model indeed does the customer segmentation by composing the 7 types of customers with their respective values of Amount, Recency, and Frequency. Using the data mining results it is now possible to make suggestions about how to increase the income:

- Definitely cluster 6 defines those who shop often and with high amount.
- Clusters 1 and 4 show good spending and good frequency, only deferring in how recent were their last purchases, where 1 is older, which suggests an active action to sell to group 1 as soon as possible and to raise its frequency.
- Cluster 2 presents the fourth best purchase and a reasonable frequency, but this is a long time without buying. This group should be sensible to promotions and activations, so that they do not get lost and make their next purchase.
- Cluster 1 is similar to 2, but has made its purchases more recently and has a slightly better periodicity. Then actions must be taken to raise their frequency and reduce the chances of them migrating to cluster 2 by staying longer without purchasing products.

### Approved Models
The K-means++ model with K = 7 is approved. We are able to give adivces valuable to the business.  

## review process
------------------

### task

At this point the resultant model hopefully appears to be satisfactory and to satisfy business needs. It is now appropriate to do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues, e.g., did we correctly build the model? Did we only use attributes that we are allowed to use and that are available for future analyses?

### output

Summarize the process review and highlight activities that have been missed and/or should be repeated. 


### Review of the Data Mining Process

* Only legally allowed data has been used
* Goals and objectives has been set
* The problems with the data have been detected and eliminated
* There are alternatives in choosing the clustering model
* Not all the possible parameters of the chosen clustering model have been adjusted
* There are other methods to choose the optimal parameter for K in K-means++ model
* Data mining goals have been met

## determine next steps
----------

### task

According to the assessment results and the process review, the project decides how to proceed at this stage. The project needs to decide whether to finish this project and move on to deployment if appropriate or whether to initiate further iterations or set up new data mining projects. This task includes analyses of remaining resources and budget that influences the decisions.

### output

#### list of possible actions 

List the potential further actions along with the reasons for and against each option.

#### Decision 

Describe the decision as to how to proceed along with the rationale.


### List of possible actions
* To sell to the group 1 as soon as possible and to raise its frequency  
For: this will bring more money since it is highly payable group  
Against: no any
* The 2nd group should be sensible to promotions and activations  
For: they do not get lost and make their next purchase  
Against: no any
* To raise the frequency of group 1  
For: reduce the chances of them migrating to cluster 2 by staying longer without purchasing products  
Against: no any 

### Decision

The data mining project is done. Now it makes sense to deploy the model in order to identify the group to which the customer belongs. This will make possible to individually interact with the customer to increase the income.

## note/questions
-------------

#### evaluate model

#### review process

#### determine next step
