# Introduction
This is an example of a Python notebook used to train a model (decision tree) and then export it as PMML so that it can be easily integrated with additional decision logic (DMN https://drools.org/learn/dmn.html).

### Use case: credit card dispute risk
In this example end user has filed a credit card dispute and we want to predict the risk related to the disputed transaction

### Prerequisite
To run this notebook you need to have:
- Python 3 https://www.python.org/downloads/
- Pip https://pypi.org/project/pip/
- jupyter https://jupyter.org/ (`pip install jupyterlab`)

Install dependencies:
- pandas
- scikit-learn
- numpy
- nyoka
- matplotlib

All of them can be installed cloning this repo and using command `pip install -r ./binder/requirements.txt`

Finally start the environment using the command `jupyter notebook`

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from nyoka import skl_to_pmml

# Step 1 - Load and prepare data
Data are usually available as unstructure files with text (csv) or binary (parquet, avro) format and there are many libraries to load them from a local or remote storage.

After the loading step it is quite common to perform some preparation/cleanup actions: check domain boundaries, handle missing values, normalize strings, convert enumaration to number, etc.

In [16]:
df = pd.read_csv('input_data.csv')
# splitting buyed_items string to arrays
df.buyed_items = df.buyed_items.str[1:-1].str.replace("'",'').str.replace(",",'').str.split(' ')
# splitting buyed_customer_items_id string to arrays
df.buyed_customer_items_id = df.buyed_customer_items_id.str[1:-1].str.replace("'",'').str.replace(",",'').str.split(' ')
df.head()

Unnamed: 0,type,buyed_items,type_index,buyed_customer_items_id
0,Book,"[Book-0, Book-1, Book-3, Book-4, Book-6, Book-8]",0,"[1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, ..."
1,Book,"[Book-3, Book-4, Book-5, Book-7, Book-8, Book-9]",0,"[0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, ..."
2,Book,"[Book-1, Book-3, Book-4, Book-7, Book-8, Book-9]",0,"[0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, ..."
3,Book,"[Book-2, Book-3, Book-4, Book-5, Book-6, Book-7]",0,"[0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ..."
4,Book,"[Book-0, Book-2, Book-3, Book-5, Book-6, Book-8]",0,"[1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ..."


# Step 2 - Prepare training set and test set
When a model is trained, it is important to use different set of data to train and test it

In [8]:
#raw_inputs_grouped = pd.DataFrame(df['buyed_items'])
#print(raw_inputs_grouped)
#raw_inputs = pd.DataFrame(raw_inputs_grouped.buyed_items.tolist(), index = raw_inputs_grouped.index)
#print(raw_inputs)

#d1 = [['Book-0', 'Car-0', 'PC-0'], ['Book-1', 'Car-1', 'PC-1'],['Book-2', 'Car-2', 'PC-2'],['Book-3', 'Car-3', 'PC-3'],['Book-4', 'Car-4', 'PC-4'],['Book-5', 'Car-5', 'PC-5'],['Book-6', 'Car-6', 'PC-6'],['Book-7', 'Car-7', 'PC-7'],['Book-8', 'Car-8', 'PC-8'],['Book-9', 'Car-9', 'PC-9']]
#d2 = [['Book-0', 'Book-1', 'Book-2', 'Book-3', 'Book-4', 'Book-5', 'Book-6', 'Book-7', 'Book-8', 'Book-9'], ['Car-0', 'Car-1', 'Car-2', 'Car-3', 'Car-4', 'Car-5', 'Car-6', 'Car-7', 'Car-8', 'Car-9'],['PC-0', 'PC-1', 'PC-2', 'PC-3', 'PC-4', 'PC-5', 'PC-6', 'PC-7', 'PC-8', 'PC-9']]


#books = ['Book-0', 'Book-1', 'Book-2', 'Book-3', 'Book-4', 'Book-5', 'Book-6', 'Book-7', 'Book-8', 'Book-9']
#cars = ['Car-0', 'Car-1', 'Car-2', 'Car-3', 'Car-4', 'Car-5', 'Car-6', 'Car-7', 'Car-8', 'Car-9']
#pcs = ['PC-0', 'PC-1', 'PC-2', 'PC-3', 'PC-4', 'PC-5', 'PC-6', 'PC-7', 'PC-8', 'PC-9']
#enc = preprocessing.OneHotEncoder(categories=[books, cars, pcs])

#enc.fit(d2)
#inputs = enc.transform([['Book-0', 'Book-1', 'Book-4']]).toarray()
#print(inputs)


inputs = df[['type_index', 'buyed_customer_items_id']]
#print(inputs)

#d1 = {'teams': [[0, 1, 0],[0, 1, 0],[0, 1, 0], [0, 1, 0],[0, 1, 0]]}
#df2 = pd.DataFrame(d1)
#print (df2)

# splitting buyed_customer_items_id array to multiple columns
df2  = pd.DataFrame(inputs.buyed_customer_items_id.tolist(), index=inputs.index)

print (df2)



    0  1  2  3  4  5  6  7  8  9   ... 20 21 22 23 24 25 26 27 28 29
0    1  1  0  1  1  0  1  0  1  0  ...  0  0  0  0  0  0  0  0  0  0
1    0  0  0  1  1  1  0  1  1  1  ...  0  0  0  0  0  0  0  0  0  0
2    0  1  0  1  1  0  0  1  1  1  ...  0  0  0  0  0  0  0  0  0  0
3    0  0  1  1  1  1  1  1  0  0  ...  0  0  0  0  0  0  0  0  0  0
4    1  0  1  1  0  1  1  0  1  0  ...  0  0  0  0  0  0  0  0  0  0
..  .. .. .. .. .. .. .. .. .. ..  ... .. .. .. .. .. .. .. .. .. ..
995  0  0  0  0  0  0  0  0  0  0  ...  1  1  1  0  1  0  0  1  1  0
996  0  0  0  0  0  0  0  0  0  0  ...  0  1  1  0  1  1  1  0  1  0
997  0  0  0  0  0  0  0  0  0  0  ...  1  0  1  0  1  1  1  0  1  0
998  0  0  0  0  0  0  0  0  0  0  ...  1  1  1  1  0  0  1  1  0  0
999  0  0  0  0  0  0  0  0  0  0  ...  1  1  1  1  0  1  0  0  0  1

[1000 rows x 30 columns]



# Step 3 - Train the model
There are many different models that can be used. In this example we will use a decision tree classifier

In [9]:


pipeline = Pipeline([
    ("classifier", KMeans(n_clusters=3, random_state=0))
])
trained_model = pipeline.fit(df2)

# Step 4 - Test the mode

There are multiple way to test the model, first of all you should test it using test data

In [10]:
model_score = trained_model.score(df2)
print("model_score: " + str(model_score))

model_score: -2395.770362128503


Note: Pay attention to overfitting problem ( https://en.wikipedia.org/wiki/Overfitting )  while you train and test your model. For example a score of 0.99 or similar is an important sign of a probable overfit.

Additionally you can print to visually compare predicted data with real data

In [32]:
dfsimple = pd.read_csv('simple_input_data.csv')
# splitting buyed_items string to arrays
dfsimple.buyed_items = dfsimple.buyed_items.str[1:-1].str.replace("'",'').str.replace(",",'').str.split(' ')
# splitting buyed_customer_items_id string to arrays
dfsimple.buyed_customer_items_id = dfsimple.buyed_customer_items_id.str[1:-1].str.replace("'",'').str.replace(",",'').str.split(' ')
dfsimple.head()

test_inputs = dfsimple['buyed_customer_items_id']
dftest  = pd.DataFrame(test_inputs.tolist(), index=test_inputs.index)


predictions = trained_model.predict(dftest)
results = pd.DataFrame({'cluster prediction': predictions.astype(int), 'truth': dfsimple['buyed_customer_items_id']})
print(results)


    cluster prediction                                              truth
0                    2  [1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...
1                    2  [0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, ...
2                    2  [0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, ...
3                    2  [0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
4                    2  [1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...
5                    2  [1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
6                    2  [0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
7                    2  [1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
8                    2  [1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...
9                    2  [0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, ...
10                   2  [1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, ...
11                   2  [1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, ...
12                   2  [1, 0, 1, 0, 0

Another approach is to plot predicted data and real data on the same chart

In [36]:
predictions = trained_model.predict(input_test)
plt.rcParams["figure.figsize"] = (20,6)

type_index = 2
testing = pd.DataFrame({'items_id': np.linspace(0, input_test['items_id'].max(), 100), 'type_index': type_index})
sub_test = df[df['type_index']==type_index]

DT_predictions = trained_model.predict(testing).astype(int)
plt.subplot(121)
plt.plot(testing['items_id'], DT_predictions, color='red', label='Predicted')
plt.scatter(sub_test['items_id'], sub_test['buyer_group'], color='black', label='truth')
plt.legend()
plt.show()

NameError: name 'input_test' is not defined

Scoring and visualization are just simple ways to test the trained model. They are usually not enough for real world use cases.

There are advanced way to analyze how the model is performing (i.e. ROC) and there are many other aspects to consider: fairness, explanability, interpretability.

Additional resources:
- https://en.wikipedia.org/wiki/Receiver_operating_characteristic
- https://christophm.github.io/interpretable-ml-book/

# Step 5 - Save the model as PMML
When your are happy with your model you can export it as PMML.

In [8]:
skl_to_pmml(trained_model, ['items_id', 'type_index'], 'buyer_group',
                "cluster_buyer_predictor.pmml")