# Clustering by Mini Batch K-Means

Here, we apply Mini Batch K-Means in attempt to segment data described by Recency, Frequency and Monetary Value of this group of customers. See [](../00-data/01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb) for how the data is prepared. 


References: 
- [K-Means](https://scikit-learn.org/stable/modules/clustering.html#k-means)
- [Mini Batch K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html)

Notebooks Sequence:
- [/00-data/00-explore-and-prepare-data.ipynb](../00-data/00-explore-and-prepare-data.ipynb)
- [/00-data/01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb](../00-data/01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb)
- [This Notebook](../01-clustering/00-clustering-by-mini-batch-k-means.ipynb)
- [/02-interpretation/00-interprete.ipynb](../02-interpretation/00-interprete.ipynb)

# Set up

In [1]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.cluster import MiniBatchKMeans
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
import pandas as pd
from sklearn.manifold import TSNE
import pickle

# Data
## Load Data

Cell below assumed that dataset is registered in AML Workspace.

In [2]:
# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

# Get information about worksapce
workspace = Workspace.from_config()

# Get dataset registered in AML by name
dataset = Dataset.get_by_name(workspace, name='online-retail-frm')

# Convert Dataset to Pandas DataFrame
df_orig = dataset.to_pandas_dataframe()

In [3]:
# Make a copy
df = df_orig.copy()
df

Unnamed: 0,Recency(Days),Frequency,Monetary(£)
0,30,100,2537.91
1,66,2,270.00
2,9,72,1457.55
3,63,30,512.50
4,86,13,459.40
...,...,...,...
3009,12,16,323.36
3010,64,9,173.90
3011,150,10,180.60
3012,0,534,1533.38


## Split Data


In [4]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.01, random_state=9)
df_train.shape
df_train.head()
df_test.shape
df_test.head()

(2983, 3)

Unnamed: 0,Recency(Days),Frequency,Monetary(£)
2113,45,100,1080.62
821,48,17,337.34
1456,9,19,149.47
1002,98,16,632.04
2233,17,21,1436.83


(31, 3)

Unnamed: 0,Recency(Days),Frequency,Monetary(£)
676,117,66,354.8
2836,0,39,366.23
1798,42,30,110.8
1884,30,13,247.0
1596,5,7,1363.2


## Define `sklearn.pipeline`
References:
- [User Guide](https://scikit-learn.org/stable/modules/compose.html#pipeline)
- [`sklearn.pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [6]:
from sklearn.preprocessing import PowerTransformer
from sklearn.pipeline import Pipeline

# Configure PowerTransformer
ptransformer = PowerTransformer(method="yeo-johnson")
ptransformer

# Configure kmeans
n_clusters = 4
batch_size = int(df_train.shape[0]*0.1)

km = MiniBatchKMeans(n_clusters=n_clusters,
                     random_state=9,
                     batch_size=batch_size,
                     max_iter=100)
km

pipeline = Pipeline(steps=[('ptransformer', ptransformer), ('mini-batch-k-means', km)],
                    verbose=True)
pipeline

# MLFlow

Create a new MLFlow experiment.

In [7]:
import mlflow

# Create an experiment
experiment_id = mlflow.create_experiment(name='online-retail-customer-segmentation-mlflow', 
                                         tags={'purpose':'tutorial', 'pipeline':'sklearn.pipeline'})

# Get experiment by experimnet_id
experiment = mlflow.get_experiment(experiment_id=experiment_id)

# Display
experiment

<Experiment: artifact_location='', experiment_id='59ded27c-77d3-42cb-acb5-9ea5b706c7f1', lifecycle_stage='active', name='online-retail-customer-segmentation-mlflow', tags={'pipeline': 'sklearn.pipeline', 'purpose': 'tutorial'}>

Set the above as active experiment

## Imply input and output signature

In [8]:
from mlflow.models import infer_signature

# Example input and output
model_output = np.array([0, 2]) # example output, i.e. cluster label
model_input = df.iloc[0:2]

# Infer signature, i.e. input and output
signature = infer_signature(model_input=model_input, model_output=model_output)
signature

  inputs = _infer_schema(model_input)


inputs: 
  ['Recency(Days)': long, 'Frequency': long, 'Monetary(£)': double]
outputs: 
  [Tensor('int64', (-1,))]

## Fit the pipeline

In [9]:
# Fit the pipeline
with mlflow.start_run() as run:
    pipeline.fit(df)
    #mlflow.sklearn.autolog()
    mlflow.sklearn.log_model(pipeline, artifact_path="model", signature=signature) 

[Pipeline] ...... (step 1 of 2) Processing ptransformer, total=   0.0s
[Pipeline]  (step 2 of 2) Processing mini-batch-k-means, total=   0.2s




ModelInfo(artifact_path='model', flavors={'python_function': {'model_path': 'model.pkl', 'loader_module': 'mlflow.sklearn', 'python_version': '3.8.13', 'env': 'conda.yaml'}, 'sklearn': {'pickled_model': 'model.pkl', 'sklearn_version': '1.1.1', 'serialization_format': 'cloudpickle', 'code': None}}, model_uri='runs:/abf0f507-bdf5-42bc-8946-338825cae11d/model', model_uuid='e29f9101a73b4a898a2867923d39412e', run_id='abf0f507-bdf5-42bc-8946-338825cae11d', saved_input_example_info=None, signature_dict={'inputs': '[{"name": "Recency(Days)", "type": "long"}, {"name": "Frequency", "type": "long"}, {"name": "Monetary(\\u00a3)", "type": "double"}]', 'outputs': '[{"type": "tensor", "tensor-spec": {"dtype": "int64", "shape": [-1]}}]'}, utc_time_created='2022-06-14 14:29:14.907836', mlflow_version='1.26.1')

## Load the trained model

In [7]:
run_id = run.info.run_id; run_id
pipeline_model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
type(pipeline_model)
pipeline_model

'abf0f507-bdf5-42bc-8946-338825cae11d'

sklearn.pipeline.Pipeline

=====

## Use model to predict

In [11]:
# Use trained model to predict using df_test
pipeline_model.predict(df_test)

array([1, 0, 2, 2, 0, 2, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 1, 2, 2, 0, 1, 0,
       2, 0, 2, 2, 2, 2, 1, 1, 3], dtype=int32)

## Retrieve `run` information

### Retrieve `run` information

In [12]:
run.data

<RunData: metrics={}, params={}, tags={'mlflow.rootRunId': 'abf0f507-bdf5-42bc-8946-338825cae11d',
 'mlflow.runName': 'willing_cart_68fjg5s9',
 'mlflow.source.name': '/anaconda/envs/py38_clustering/lib/python3.8/site-packages/ipykernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'Chew-Yean Yam'}>

In [13]:
run.info

<RunInfo: artifact_uri='azureml://experiments/Default/runs/abf0f507-bdf5-42bc-8946-338825cae11d/artifacts', end_time=None, experiment_id='715de42a-bcc1-435d-a6a8-2ecb17c71da6', lifecycle_stage='active', run_id='abf0f507-bdf5-42bc-8946-338825cae11d', run_uuid='abf0f507-bdf5-42bc-8946-338825cae11d', start_time=1655216954324, status='RUNNING', user_id='91072e23-c428-4c22-aed3-07e2c212bc44'>

### Retreive `artifacts`

In [14]:
# Retrieve mlflow tracking
client = mlflow.tracking.MlflowClient()
client

# List mlflow artifacts
client.list_artifacts(run_id=run.info.run_id)

<mlflow.tracking.client.MlflowClient at 0x7f15732f4e50>

[<FileInfo: file_size=-1, is_dir=True, path='model'>]

## Data Management

### Upload to Datastore

In [9]:
if False:
#if True:
    from azureml.core import Workspace, Dataset

    workspace = Workspace.from_config()
    print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id, sep = '\n')

    datastore = workspace.get_default_datastore()
    datastore

    # Save to local
    filename = '../../.aml/data/online-retail-frm-train.csv'
    df_train.to_csv(filename, index=False)

    filename = '../../.aml/data/online-retail-frm-test.csv'
    df_test.to_csv(filename, index=False)

    # Upload to datastore
    Dataset.File.upload_directory('../../.aml/data', datastore, overwrite=True)

chyam_aml_tutorial_2022_03
chyam_aml_tutorial_2022_03
westeurope
b5ba1607-7cac-4a12-9477-7853892342c8


{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-18855be7-60d0-4ac2-80e7-ddfa5e86cf24",
  "account_name": "chyamamltutori8678483931",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

Validating arguments.
Arguments validated.
Uploading file to /
Uploading an estimated of 6 files
Uploading ../../.aml/data/online-retail-frm-test.csv
Uploaded ../../.aml/data/online-retail-frm-test.csv, 1 files out of an estimated total of 6
Uploading ../../.aml/data/online-retail-frm-train.csv
Uploaded ../../.aml/data/online-retail-frm-train.csv, 2 files out of an estimated total of 6
Uploading ../../.aml/data/online-retail-frm-transformed.csv
Uploaded ../../.aml/data/online-retail-frm-transformed.csv, 3 files out of an estimated total of 6
Uploading ../../.aml/data/online-retail-frm.csv
Uploaded ../../.aml/data/online-retail-frm.csv, 4 files out of an estimated total of 6
Uploading ../../.aml/data/online-retail-processed.csv
Uploaded ../../.aml/data/online-retail-processed.csv, 5 files out of an estimated total of 6
Uploading ../../.aml/data/online-retail.csv
Uploaded ../../.aml/data/online-retail.csv, 6 files out of an estimated total of 6
Uploaded 6 files
Creating new dataset


{
  "source": [
    "('workspaceblobstore', '//')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ]
}

### Register Dataframe as Dataset

In [13]:
if False:
#if True:
    from azureml.core import Workspace, Dataset

    workspace = Workspace.from_config()
    workspace

    datastore = workspace.get_default_datastore()
    datastore

    # Dataset name to register as 
    name = 'online-retail-frm-train'

    # create a new dataset
    Dataset.Tabular.register_pandas_dataframe(dataframe=df_train, 
                                              target=datastore, 
                                              name=name, 
                                              show_progress=True, 
                                              tags={'Purpose':'Tutorial'})

    # Dataset name to register as 
    name = 'online-retail-frm-test'

    # create a new dataset
    Dataset.Tabular.register_pandas_dataframe(dataframe=df_test, 
                                              target=datastore, 
                                              name=name, 
                                              show_progress=True, 
                                              tags={'Purpose':'Tutorial'})

Workspace.create(name='chyam_aml_tutorial_2022_03', subscription_id='b5ba1607-7cac-4a12-9477-7853892342c8', resource_group='chyam_aml_tutorial_2022_03')

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-18855be7-60d0-4ac2-80e7-ddfa5e86cf24",
  "account_name": "chyamamltutori8678483931",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/d67b00e4-80ef-4276-8e44-6c31927d2bf6/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


{
  "source": [
    "('workspaceblobstore', 'managed-dataset/d67b00e4-80ef-4276-8e44-6c31927d2bf6/')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ],
  "registration": {
    "id": "1311309d-15a3-4b08-b346-519b3ecbaff4",
    "name": "online-retail-frm-train",
    "version": 1,
    "tags": {
      "Purpose": "Tutorial"
    },
    "workspace": "Workspace.create(name='chyam_aml_tutorial_2022_03', subscription_id='b5ba1607-7cac-4a12-9477-7853892342c8', resource_group='chyam_aml_tutorial_2022_03')"
  }
}

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/dabc765e-e7e4-4feb-a18f-ae06e787b944/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


{
  "source": [
    "('workspaceblobstore', 'managed-dataset/dabc765e-e7e4-4feb-a18f-ae06e787b944/')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ],
  "registration": {
    "id": "2fdb0af0-e5aa-4e0a-8296-57edd7d5da2c",
    "name": "online-retail-frm-test",
    "version": 1,
    "tags": {
      "Purpose": "Tutorial"
    },
    "workspace": "Workspace.create(name='chyam_aml_tutorial_2022_03', subscription_id='b5ba1607-7cac-4a12-9477-7853892342c8', resource_group='chyam_aml_tutorial_2022_03')"
  }
}

##