# Clustering by Mini Batch K-Means

Here, we apply Mini Batch K-Means in attempt to segment data described by Recency, Frequency and Monetary Value of this group of customers. See [](../00-data/01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb) for how the data is prepared. 


References: 
- [K-Means](https://scikit-learn.org/stable/modules/clustering.html#k-means)
- [Mini Batch K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html)

Notebooks Sequence:
- [/00-data/00-explore-and-prepare-data.ipynb](../00-data/00-explore-and-prepare-data.ipynb)
- [/00-data/01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb](../00-data/01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb)
- [This Notebook](../01-clustering/00-clustering-by-mini-batch-k-means.ipynb)
- [/02-interpretation/00-interprete.ipynb](../02-interpretation/00-interprete.ipynb)

# Set up

In [None]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.cluster import MiniBatchKMeans
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
import pandas as pd
from sklearn.manifold import TSNE
import pickle

# Data
## Load Data

Cell below assumed that dataset is registered in AML Workspace.

In [None]:
# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

# Get information about worksapce
workspace = Workspace.from_config()

# Get dataset registered in AML by name
dataset = Dataset.get_by_name(workspace, name='online-retail-frm')

# Convert Dataset to Pandas DataFrame
df_orig = dataset.to_pandas_dataframe()

In [None]:
# Make a copy
df = df_orig.copy()
df

### Define `sklearn.pipeline`
References:
- [User Guide](https://scikit-learn.org/stable/modules/compose.html#pipeline)
- [`sklearn.pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [None]:
from sklearn.preprocessing import PowerTransformer
from sklearn.pipeline import Pipeline

# Configure PowerTransformer
ptransformer = PowerTransformer(method="yeo-johnson")
ptransformer

# Configure kmeans
n_clusters = 4
batch_size = int(df.shape[0]*0.1)

km = MiniBatchKMeans(n_clusters=n_clusters,
                     random_state=9,
                     batch_size=batch_size,
                     max_iter=100)
km

pipeline = Pipeline(steps=[('ptransformer', ptransformer), ('mini-batch-k-means', km)],
                    verbose=True)
pipeline

# MLFlow

Create a new MLFlow experiment.

In [None]:
import mlflow

# Create an experiment
experiment_id = mlflow.create_experiment(name='online-retail-customer-segmentation-mlflow', 
                                         tags={'purpose':'tutorial', 'pipeline':'sklearn.pipeline'})

# Get experiment by experimnet_id
experiment = mlflow.get_experiment(experiment_id=experiment_id)

# Display
experiment

Set the above as active experiment

## Imply input and output signature

In [None]:
from mlflow.models import infer_signature

# Example input and output
model_output = np.array([0, 2]) # example output, i.e. cluster label
model_input = df.iloc[0:2]

# Infer signature, i.e. input and output
signature = infer_signature(model_input=model_input, model_output=model_output)
signature

## Fit the pipeline

Fit the pipeline with `mlflow.sklearn.autolog()`.

In [None]:
#mlflow.end_run()

# Fit the pipeline
with mlflow.start_run() as run:
    pipeline.fit(df)
    #mlflow.sklearn.autolog()
    mlflow.sklearn.log_model(pipeline, artifact_path="model", signature=signature) # You will get "Outputs + logs" /model/, /pipeline/. The content of 'MLModel' is slightly different

## Load the trained model

In [None]:
run_id = run.info.run_id; run_id
pipeline_model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
type(pipeline_model)
pipeline_model

## Use model to predict

In [None]:
# Sample test data
test_data = [[12, 109, 1647],  # cluster 2
             [85, 33, 553],    # cluster 3
             [84, 6, 146],     # cluster 1
             [12, 22, 348]]    # cluster 0

# Use trained model to predict
pipeline_model.predict(test_data)

The above warn about feature names, see cell below for resolution.

In [None]:
# Sample test data
data = [[12, 109, 1647],  # cluster 2
        [85, 33, 553],    # cluster 3
        [84, 6, 146],     # cluster 1
        [12, 22, 348]]    # cluster 0

test_data = pd.DataFrame(data, columns=['Recency(Days)', 'Frequency', 'Monetary(£)'])

# Use trained model to predict
pipeline_model.predict(test_data)

## Retrieve `run` information

### Retrieve `run` information

In [None]:
run.data

In [None]:
run.info

### Retreive `artifacts`

In [None]:
# Retrieve mlflow tracking
client = mlflow.tracking.MlflowClient()
client

# List mlflow artifacts
client.list_artifacts(run_id=run.info.run_id)

##