# Introduction

## Using this notebook

This notebook helps you to analyze datasets and to find interesting and meaningful patterns in the data. If you are only interested in looking at an automated report outlining the most important features of your dataset, you can upload your datafile via the *dataset* variable and run the notebook. Afterwards, you can export the report as HTML and read it in a webbrowser.

If you are interested in a more interactive analysis of your data, you can also adapt the parameters of the notebook to suit your needs. Each section conatins several values which can be adapted to your needs. These values are described in the code comments.

Finally, if you want to go beyond an automated report and answer your own questions, you can look at the final section of the notebook and use the code examples there to generate your own figures and analysis from the data model.

### Reading this report in a webbrowser

This report uses several statistical methods and specific phrases and concepts from the domains of statistics and machine learning. Whenever such methods are used, a small "Explanation" sign at the side of the report marks a short explanation of the methods and phrases. Clicking it will reveal the explanation.

You can toggle the global visibility of these explanations with a button at the top left corner of the report. The code can also be toggled with a button.

All graphs are interactive and will display additional content on hover. You can get the exact values of the functions by selecting the assoziated areas in the graph. You can also move the plots around and zoom into interesting parts.

### Aknowledgments

This notebook is build on the MSPN implementation by Molina et.al. during the course of a bachelor thesis under the supervision of Alejandro Molina and Kristian Kersting at TU Darmstadt. The goal of this framework is to sum product networks for hybrid domains and to highlight important aspects and interesting features of a given dataset.

In [1]:
import pickle
import pandas as pd
import numpy as np

#from tfspn.SPN import SPN
from pprint import PrettyPrinter
from IPython.display import Image
from IPython.display import display, Markdown
from importlib import reload

import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

from src.util.text_util import printmd, strip_dataset_name
import src.ba_functions as f
import src.dn_plot as p
import src.dn_text_generation as descr
import src.util.data_util as util
from src.util.spn_util import get_categoricals

from src.util.CSVUtil import learn_piecewise_from_file, load_from_csv

init_notebook_mode(connected=True)
# pp = PrettyPrinter()

In [18]:
# path to the dataset you want to use for training
dataset = 'example_data/top20medical.csv'

# the minimum number of datapoints that are included in a child of a 
# sum node
min_instances = 60

# the parameter which governs how strict the independence test will be
# 1 results in all features being evaluated as independent, 0 will 
# result in no features being acccepted as truly independent
independence_threshold = 0.3


spn, dictionary = learn_piecewise_from_file(
    data_file=dataset, 
    header=0, 
    min_instances=min_instances, 
    independence_threshold=independence_threshold, 
    feature_file='example_data/top20medical.features')
df = pd.read_csv(dataset)
context = dictionary['context']
context.dataset = strip_dataset_name(dataset)
categoricals = get_categoricals(spn, context)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


X scores are null at iteration 0


X scores are null at iteration 0


X scores are null at iteration 0


invalid value encountered in true_divide


invalid value encountered in true_divide


X scores are null at iteration 0


invalid value encountered in true_divide


invalid value encountered in true_divide


X scores are null at iteration 0


invalid value encountered in true_divide


X scores are null at iteration 0


invalid value encountered in true_divide


X scores are null at iteration 0


invalid value encountered in true_divide


X scores are null at iteration 0


Y residual constant at iteration 0


invalid value encountered in true_divide


Y residual constant at iteration 0


invalid value encountered in true_divide


invalid value encountered in true_divide


X scores are null at it

In [19]:
# path to the model pickle file
model_path = "deep_notebooks/models/test.pickle"

# UNCOMMENT THE FOLLOWING LINES TO LOAD A MODEL
#spn = pickle.load(open('../myokardinfarkt/spn_save.txt', 'rb'))
#df, _, dictionary = load_from_csv('../myokardinfarkt/data/cleaned_pandas.csv', header = 0)
#context = pickle.load(open('../myokardinfarkt/context_save.txt', 'rb'))
#context.feature_names = ([entry['name']
#                                  for entry in dictionary['features']])
#dictionary[context] = context

In [20]:
reload(descr)
descr.introduction(context)

# Exploring the top20medical dataset

<figure align="right" style="padding: 1em; float:right; width: 300px">
	<img alt="the logo of TU Darmstadt"
		src="images/tu_logo.gif">
	<figcaption><i>Report framework created @ TU Darmstadt</i></figcaption>
        </figure>
This report describes the dataset top20medical and contains general statistical
information and an analysis on the influence different features and subgroups
of the data have on each other. The first part of the report contains general
statistical information about the dataset and an analysis of the variables
and probability distributions.<br/>
The second part focusses on a subgroup analysis of the data. Different
clusters identified by the network are analyzed and compared to give an
insight into the structure of the data. Finally the influence different
variables have on the predictive capabilities of the model are analyzes.<br/>
The whole report is generated by fitting a sum product network to the
data and extracting all information from this model.

## General statistical evaluation

In [21]:
reload(descr)
descr.data_description(context, df)



The dataset contains 505 entries and is comprised of 21 features, which are "fi_tre002", "fi_tre00125", "fi_tre00124", "fi_tre00123", "fi_tre00122", "pe_ecg0041", "qu_cad002211", "pe_ecg004", "fi_adm006", "lb_mmy0061", "lb_mmy0072", "lb_mmy0012", "qu_crf021", "pe_ecg002", "qu_mpa001", "pe_ecg006", "pe_ecg003", "lb_mmy0062", "qu_crf022", "qu_mpa0011", "di_dia002".

"fi_tre00125", "lb_mmy0061", "lb_mmy0072", "lb_mmy0012", "pe_ecg003", "lb_mmy0062" are continuous features, while "fi_tre002", "fi_tre00124", "fi_tre00123", "fi_tre00122", "pe_ecg0041", "qu_cad002211", "pe_ecg004", "fi_adm006", "qu_crf021", "pe_ecg002", "qu_mpa001", "pe_ecg006", "qu_crf022", "qu_mpa0011", "di_dia002" is a  categorical feature. Continuous and discrete features were approximated with piecewise            linear density functions, while categorical features are represented by             histogramms of their probability.

Below, the means and standard deviations of each feature are shown. Categorical 
features do not have a mean and a standard deviation, since they contain no ordering. Instead, 
the network returns NaN.

In [22]:
descr.means_table(spn, context)

In the following section, the marginal distributions for each feature is shown. This 
is the distribution of each feature without knowing anything about the other values.

In [23]:
reload(descr)
descr.features_shown = 'all'

descr.show_feature_marginals(spn, dictionary)

### Correlations

To get a sense of how the features relate to one another, the correlation between 
them is analyzed in the next section. The correlation denotes how strongly two features are 
linked. A high correlation (close to 1 or -1) means that two features are very closely related, 
while a correlation close to 0 means that there is no linear interdependency between the features.

The correlation is reported in a colored matrix, where blue denotes a negative and red denotes 
a positive correlation.

In [24]:
descr.correlation_threshold = 0.4

corr = descr.correlation_description(spn, dictionary)


invalid value encountered in greater


invalid value encountered in less



[[1.00000000e+00            nan 1.39131691e-01 4.70523843e-02
  4.05605664e-03 1.70284751e-02 1.94726021e-02 1.34068321e-02
  1.96377818e-02            nan            nan            nan
  3.51602820e-03 1.70904420e-02 3.23936145e-03 7.32941659e-04
             nan            nan 2.85178581e-03 2.57693355e-03
  8.90690377e-02]
 [           nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan]
 [1.39131691e-01            nan 1.00000000e+00 7.61440969e-02
  2.30588916e-02 3.62609536e-02 4.34240695e-02 1.60349631e-02
  2.59985336e-02            nan            nan            nan
  3.38983917e-03 2.97458597e-02 3.23519608e-03 1.85779168e-03
             nan            nan 3.15066830e-03 2.61623712e-03
  1.58390849e-01]
 [4.70523843e-02

No features show more then a very weak correlation.

The conditional distributions are the probabilities of the features, given 
a certain instance of a class. The joint probability functions of correlated variables 
are shown below to allow a more in-depth look into the dependency.

In [25]:
reload(descr)

descr.correlation_threshold = 0.2
descr.feature_combinations = 'all'
descr.show_conditional = True

descr.categorical_correlations(spn, dictionary)

[[1.00000000e+00            nan 1.39131691e-01 4.70523843e-02
  4.05605664e-03 1.70284751e-02 1.94726021e-02 1.34068321e-02
  1.96377818e-02            nan            nan            nan
  3.51602820e-03 1.70904420e-02 3.23936145e-03 7.32941659e-04
             nan            nan 2.85178581e-03 2.57693355e-03
  8.90690377e-02]
 [           nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan            nan            nan            nan
             nan]
 [1.39131691e-01            nan 1.00000000e+00 7.61440969e-02
  2.30588916e-02 3.62609536e-02 4.34240695e-02 1.60349631e-02
  2.59985336e-02            nan            nan            nan
  3.38983917e-03 2.97458597e-02 3.23519608e-03 1.85779168e-03
             nan            nan 3.15066830e-03 2.61623712e-03
  1.58390849e-01]
 [4.70523843e-02

"fi_tre00125" and "fi_tre002" have a weak  relation.

The features "fi_tre00124" and "fi_tre00125" have a moderate  dependency.

The model shows a weak  relation for "pe_ecg003" and "fi_tre00124".

The features "lb_mmy0062" and "fi_tre002" have a weak  relation.

The features "lb_mmy0062" and "fi_tre00124" have a weak  interdependency.

The model shows a weak  relation for the features "lb_mmy0062" and "pe_ecg0041".

"lb_mmy0062" and "fi_adm006" have a weak  relation.

The features "lb_mmy0062" and "lb_mmy0061" have a moderate  interdependency.

The features "lb_mmy0062" and "lb_mmy0012" have a weak  relation.

There is a weak  relationship for the features "di_dia002" and "fi_tre00125".

"di_dia002" and "lb_mmy0061" have a weak  interdependency.

"di_dia002" and "lb_mmy0062" have a moderate  relationship.

---

## Cluster evaluation

To give an impression of the data representation as a whole, the complete network graph is 
shown below. The model is a tree, with a sum node at its center. The root of the tree is shown 
in white, while the sum and product nodes are green and blue respectively. Finally, all 
leaves are represented by red nodes.

In [26]:
#p.plot_graph(spn=spn, fname='deep_notebooks/images/graph.png', context=context)
#display(Image(filename='deep_notebooks/images/graph.png', width=400))

The data model provides a clustering of the data points into groups in which features are 
independent. The groups extracted from the data are outlined below together with a short 
description of the data they cover. Each branch in the model represents one cluster found 
in the data model.

### Description of all clusters

In [27]:
# possible values: 'all', 'big', int (leading to a random sample), list of nodes to be displayed
nodes = f.get_sorted_nodes(spn)

reload(descr)
descr.nodes = 'all'
descr.show_node_graphs = False

descr.node_introduction(spn, nodes, context)

The SPN contains 6 clusters.


These are:

- a deep Product Node, representing 58.42% of the data.
  - The node has 2 children and 374 descendants,                    resulting in a remaining depth of 7.


- a deep Product Node, representing 11.09% of the data.
  - The node has 21 children and 63 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 9.9% of the data.
  - The node has 21 children and 63 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 9.11% of the data.
  - The node has 21 children and 63 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 8.71% of the data.
  - The node has 21 children and 63 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 2.77% of the data.
  - The node has 21 children and 63 descendants,                    resulting in a remaining depth of 3.


The node representatives are the most likely data points for each node.            They are archetypal for what the node represents and what subgroup of            the data it encapsulates.

As stated above, each cluster captures a subgroup of the data. To show what variables are 
captured by which cluster, the means and variances for each feature and subgroup are plotted below. 
This highlights where the node has its focus.

In [28]:
descr.features_shown = 'all'
descr.mean_threshold = 0.1
descr.variance_threshold = 0.1
descr.separation_threshold = 0.1

separations = descr.show_node_separation(spn, nodes, context)

The feature "fi_tre00125" is weakly separated by the clustering. The variances of the nodes 0, 4, 5 are significantly larger then the average node. The means of the nodes 0, 5 are significantly larger then the average node. The means of the nodes 1, 2, 3 are significantly smaller then the average node.

The feature "lb_mmy0061" is weakly separated by the clustering. The variance of node 4 is significantly larger then the average node. The means of the nodes 3, 4 are significantly larger then the average node. The means of the nodes 0, 1 are significantly smaller then the average node.

The feature "lb_mmy0072" is weakly separated by the clustering. The variances of the nodes 0, 1, 4 are significantly larger then the average node. The means of the nodes 3, 4 are significantly larger then the average node. The mean of node 0 is significantly smaller then the average node.

The feature "lb_mmy0012" is weakly separated by the clustering. The variance of node 4 is significantly larger then the average node. The mean of node 4 is significantly larger then the average node. The mean of node 0 is significantly smaller then the average node.

The feature "pe_ecg003" is weakly separated by the clustering. The variance of node 0 is significantly larger then the average node. The mean of node 0 is significantly larger then the average node. The means of the nodes 1, 2, 4, 5 are significantly smaller then the average node.

The feature "lb_mmy0062" is moderately separated by the clustering. The variance of node 4 is significantly larger then the average node. The means of the nodes 3, 4 are significantly larger then the average node. The means of the nodes 0, 1 are significantly smaller then the average node.

An analysis of the 
distribution of categorical variables is given below. If a cluster or a group of clusters 
capture a large fraction of the total likelihood of a categorical instance, they can be 
interpreted to represent this instance and the associated distribution.

In [29]:
reload(descr)

descr.categoricals = 'all'

descr.node_categorical_description(spn, dictionary)

#### Distribution of fi_tre002

88.91% of "[0.]" is captured by the nodes 0, 3. The probability of                        "[0.]" for this group of nodes is 43.19%

The feature "fi_tre002" is not separated well along the primary                        clusters.

#### Distribution of fi_tre00124

The feature "fi_tre00124" is not separated well along the primary                        clusters.

#### Distribution of fi_tre00123

The feature "fi_tre00123" is not separated well along the primary                        clusters.

#### Distribution of fi_tre00122

The feature "fi_tre00122" is not separated well along the primary                        clusters.

#### Distribution of pe_ecg0041


invalid value encountered in double_scalars



The feature "pe_ecg0041" is not separated well along the primary                        clusters.

#### Distribution of qu_cad002211

98.0% of "[0.]" is captured by the nodes 0, 1, 2, 4, 5. The probability of                        "[0.]" for this group of nodes is 53.47%

91.18% of "[1.]" is captured by the nodes 0, 3. The probability of                        "[1.]" for this group of nodes is 60.61%

#### Distribution of pe_ecg004

75.39% of "[0.]" is captured by the nodes 0, 3. The probability of                        "[0.]" for this group of nodes is 61.47%

The feature "pe_ecg004" is not separated well along the primary                        clusters.

#### Distribution of fi_adm006

83.68% of "[0.]" is captured by the nodes 0, 2, 3. The probability of                        "[0.]" for this group of nodes is 59.77%

The feature "fi_adm006" is not separated well along the primary                        clusters.

#### Distribution of qu_crf021

100.0% of "[0]" is captured by the nodes 0, 1, 2, 3, 4, 5. The probability of                        "[0]" for this group of nodes is 50.0%

87.5% of "[1]" is captured by the nodes 0, 2, 4. The probability of                        "[1]" for this group of nodes is 54.8%

#### Distribution of pe_ecg002

The feature "pe_ecg002" is not separated well along the primary                        clusters.

#### Distribution of qu_mpa001

100.0% of "[0.]" is captured by the nodes 0, 1, 2, 3, 4, 5. The probability of                        "[0.]" for this group of nodes is 50.0%

86.55% of "[1.]" is captured by the nodes 0, 1, 4. The probability of                        "[1.]" for this group of nodes is 54.09%

#### Distribution of pe_ecg006

The feature "pe_ecg006" is not separated well along the primary                        clusters.

#### Distribution of qu_crf022

The feature "qu_crf022" is not separated well along the primary                        clusters.

#### Distribution of qu_mpa0011

100.0% of "[0.]" is captured by the nodes 0, 1, 2, 3, 4, 5. The probability of                        "[0.]" for this group of nodes is 50.0%

85.99% of "[1.]" is captured by the nodes 0, 1, 4. The probability of                        "[1.]" for this group of nodes is 53.62%

#### Distribution of di_dia002

71.52% of "[0]" is captured by the nodes 1, 2, 4, 5. The probability of                        "[0]" for this group of nodes is 62.8%

The feature "di_dia002" is not separated well along the primary                        clusters.

### Correlations by cluster

Finally, since each node captures different interaction between the features, it is 
interesting to look at the correlations again, this time for the seperate nodes. Shallow 
nodes are omitted, because the correlation of independent variables is always 0.

In [30]:
reload(descr)

descr.correlation_threshold = 0.1
descr.nodes = 'all'

descr.node_correlation(spn, dictionary)

### Correlations for node 0

[[ 1.00000000e+00             nan  1.12823805e-01  2.35051169e-02
   3.64855933e-03  1.27464111e-02  9.06941443e-03  1.54010072e-03
   1.56428327e-03             nan             nan             nan
   0.00000000e+00  9.81410734e-03  0.00000000e+00  1.76619784e-03
              nan             nan  0.00000000e+00  0.00000000e+00
   5.35019755e-02]
 [            nan             nan             nan             nan
              nan             nan             nan             nan
              nan             nan             nan             nan
              nan             nan             nan             nan
              nan             nan             nan             nan
              nan]
 [ 1.12823805e-01             nan  1.00000000e+00  4.88582053e-02
   2.93524262e-02  3.76797128e-02  3.25107560e-02  1.27872052e-04
   3.39701777e-03             nan             nan             nan
   0.00000000e+00  2.17824154e-02  0.00000000e+00  4.46315184e-03
              nan             nan  0.0

"fi_tre00125" and "fi_tre002" influence each other weakly. As one increases, the other increases. The model shows a weak positive linear relation for the features "fi_tre00124" and "fi_tre002". The model shows a weak positive dependency for "fi_tre00124" and "fi_tre00125". There is a weak positive linear relationship for "fi_tre00123" and "fi_tre00125". "fi_tre00122" and "fi_tre00123" have a weak positive interdependency. "pe_ecg0041" and "fi_tre00125" influence each other weakly. As one increases, the other increases. There is a weak positive relation for the features "qu_mpa001" and "qu_crf021". "pe_ecg003" and "fi_tre002" influence each other weakly, and "pe_ecg003" and "fi_tre00125" influence each other weakly. The model shows a weak positive dependency between "qu_crf022" and "qu_crf021". "qu_mpa0011" and "qu_crf021" influence each other weakly, but "qu_mpa0011" and "fi_tre00125" influence each other moderately. "di_dia002" and "fi_tre00125" influence each other weakly, and "di_dia002" and "fi_tre00125" influence each other weakly.

All other features do not have more then a very weak correlation.

### Correlations for node 1

[[ 1. nan  0.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [ 0. nan  1.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  1.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  1.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  1.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  1.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  1.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  0.  1. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan

No features show more then a very weak correlation.

### Correlations for node 2

[[ 1. nan  0.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [ 0. nan  1.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  1.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  1.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  1.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  1.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  1.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  0.  1. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan

No features show more then a very weak correlation.

### Correlations for node 3

[[ 1. nan  0.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [ 0. nan  1.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  1.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  1.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  1.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  1.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  1.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  0.  1. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan

No features show more then a very weak correlation.

### Correlations for node 4

[[ 1. nan  0.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [ 0. nan  1.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  1.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  1.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  1.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  1.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  1.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  0.  1. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan

No features show more then a very weak correlation.

### Correlations for node 5

[[ 1. nan  0.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [ 0. nan  1.  0.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  1.  0.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  1.  0.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  1.  0.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  1.  0.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  1.  0. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [ 0. nan  0.  0.  0.  0.  0.  0.  1. nan nan nan  0.  0.  0.  0. nan nan
   0.  0.  0.]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan nan]
 [nan nan nan nan nan

No features show more then a very weak correlation.

---
## Predictive data analysis

In [31]:
reload(util)
numerical_data, categorical_data = util.get_categorical_data(spn, df, dictionary)

After the cluster description, the data model is used to predict data points. To evaluate 
the performance of the model, the misclassification rate is shown below.

The classified data points are used to analyze more advanced patterns within the data, by looking
first at the misclassified points, and then at the classification results in total.

In [32]:
descr.classify = 'all'

misclassified, data_dict = descr.classification(spn, numerical_data, dictionary)

For feature "fi_tre002" the SPN misclassifies 190 instances, resulting in a precision of 62.38%.

For feature "fi_tre00124" the SPN misclassifies 303 instances, resulting in a precision of 40.0%.

For feature "fi_tre00123" the SPN misclassifies 326 instances, resulting in a precision of 35.45%.

For feature "fi_tre00122" the SPN misclassifies 387 instances, resulting in a precision of 23.37%.

For feature "pe_ecg0041" the SPN misclassifies 432 instances, resulting in a precision of 14.46%.

For feature "qu_cad002211" the SPN misclassifies 34 instances, resulting in a precision of 93.27%.

For feature "pe_ecg004" the SPN misclassifies 189 instances, resulting in a precision of 62.57%.

For feature "fi_adm006" the SPN misclassifies 154 instances, resulting in a precision of 69.5%.

For feature "qu_crf021" the SPN misclassifies 117 instances, resulting in a precision of 76.83%.

For feature "pe_ecg002" the SPN misclassifies 105 instances, resulting in a precision of 79.21%.

For feature "qu_mpa001" the SPN misclassifies 85 instances, resulting in a precision of 83.17%.

For feature "pe_ecg006" the SPN misclassifies 48 instances, resulting in a precision of 90.5%.

For feature "qu_crf022" the SPN misclassifies 89 instances, resulting in a precision of 82.38%.

For feature "qu_mpa0011" the SPN misclassifies 82 instances, resulting in a precision of 83.76%.

For feature "di_dia002" the SPN misclassifies 199 instances, resulting in a precision of 60.59%.

Below, the misclassified examples are explained using the clusters they are most assoiciated with.
For each instance, those clusters which form 90 % of the prediction are reported together eith the
representatives of these clusters.

In [33]:
# IMPORTANT: Only set use_shapley to true if you have a really powerful machine
reload(descr)
reload(p)

descr.use_shapley = False
descr.shapley_sample_size = 1
descr.misclassified_explanations = 1

descr.describe_misclassified(spn, dictionary, misclassified, data_dict, numerical_data)

Instance 80 was predicted as "[0.]", even though it is "[1.]", because it was most similar to the following clusters: 0, 1, 2, 4, 5

ValueError: cannot convert float NaN to integer

### Information gain through features

The following graphs highlight the relative importance of different features for a 
classification. It can show how different classes are predicted. For continuous and
discrete features, a high positvie or negative importance shows that changing this features
value positive or negative increases the predictions certainty.

For categorical values, positive and negative values highlight whether changing or keeping
this categorical value increases or decreasies the predictive certainty.

In [35]:
reload(descr)
reload(p)

descr.explanation_vector_threshold = 0.2
descr.explanation_vector_classes = [20]
descr.explanation_vectors_show = 'all'

expl_vectors = descr.explanation_vector_description(spn, dictionary, data_dict, categoricals, use_shap=True)

#### Class "di_dia002": "[0]"

##### Predictive categorical feature "fi_tre002"



##### Predictive continuous feature "fi_tre00125": "5.0"



##### Predictive categorical feature "fi_tre00124"



##### Predictive categorical feature "fi_tre00123"



##### Predictive categorical feature "fi_tre00122"



##### Predictive categorical feature "pe_ecg0041"



##### Predictive categorical feature "qu_cad002211"



##### Predictive categorical feature "pe_ecg004"



##### Predictive categorical feature "fi_adm006"



##### Predictive continuous feature "lb_mmy0061": "10000.0"



##### Predictive continuous feature "lb_mmy0072": "229906.4"



##### Predictive continuous feature "lb_mmy0012": "8997.0"



##### Predictive categorical feature "qu_crf021"



##### Predictive categorical feature "pe_ecg002"



##### Predictive categorical feature "qu_mpa001"



##### Predictive categorical feature "pe_ecg006"



##### Predictive continuous feature "pe_ecg003": "197.0"



##### Predictive continuous feature "lb_mmy0062": "10000.0"



##### Predictive categorical feature "qu_crf022"



##### Predictive categorical feature "qu_mpa0011"



#### Class "di_dia002": "[1]"

##### Predictive categorical feature "fi_tre002"



##### Predictive continuous feature "fi_tre00125": "5.0"



##### Predictive categorical feature "fi_tre00124"



##### Predictive categorical feature "fi_tre00123"



##### Predictive categorical feature "fi_tre00122"



##### Predictive categorical feature "pe_ecg0041"



##### Predictive categorical feature "qu_cad002211"



##### Predictive categorical feature "pe_ecg004"



##### Predictive categorical feature "fi_adm006"



##### Predictive continuous feature "lb_mmy0061": "10000.0"



##### Predictive continuous feature "lb_mmy0072": "229906.4"



##### Predictive continuous feature "lb_mmy0012": "8997.0"



##### Predictive categorical feature "qu_crf021"



##### Predictive categorical feature "pe_ecg002"



##### Predictive categorical feature "qu_mpa001"



##### Predictive categorical feature "pe_ecg006"



##### Predictive continuous feature "pe_ecg003": "197.0"



##### Predictive continuous feature "lb_mmy0062": "10000.0"



##### Predictive categorical feature "qu_crf022"



##### Predictive categorical feature "qu_mpa0011"



#### Class "di_dia002": "[2]"

##### Predictive categorical feature "fi_tre002"



##### Predictive continuous feature "fi_tre00125": "5.0"



##### Predictive categorical feature "fi_tre00124"



##### Predictive categorical feature "fi_tre00123"



##### Predictive categorical feature "fi_tre00122"



##### Predictive categorical feature "pe_ecg0041"



##### Predictive categorical feature "qu_cad002211"



##### Predictive categorical feature "pe_ecg004"



##### Predictive categorical feature "fi_adm006"



##### Predictive continuous feature "lb_mmy0061": "10000.0"



##### Predictive continuous feature "lb_mmy0072": "229906.4"



##### Predictive continuous feature "lb_mmy0012": "8997.0"



##### Predictive categorical feature "qu_crf021"



##### Predictive categorical feature "pe_ecg002"



##### Predictive categorical feature "qu_mpa001"



##### Predictive categorical feature "pe_ecg006"



##### Predictive continuous feature "pe_ecg003": "197.0"



##### Predictive continuous feature "lb_mmy0062": "10000.0"



##### Predictive categorical feature "qu_crf022"



##### Predictive categorical feature "qu_mpa0011"



### SHAP explanation

The SHAP values for classification show how much influence each feature had on the classification of the datapoint. 

In [None]:
#from spn.algorithms.Inference import log_likelihood
#import numpy as np
#
#def create_predictor(spn, index, values):
#    
#    def predict_proba(x):
#        all_probs = []
#        x[:, index] = np.nan
#        normalization = log_likelihood(spn, x)
#    
#        for v in values:
#            x[:, index] = v
#            all_probs.append(log_likelihood(spn, x) - normalization)
#        print(np.exp(np.array(all_probs)).reshape(x.shape[0], -1))
#        return np.exp(np.array(all_probs)).reshape(x.shape[0], -1)
#    
#    return predict_proba


In [None]:
#import shap
#
## print the JS visualization code to the notebook
#shap.initjs()
#
## use Kernel SHAP to explain test set predictions
#explainer = shap.KernelExplainer(create_predictor(spn, 0, [0,1]), numerical_data)
#print('Done 1')
#shap_values = explainer.shap_values(numerical_data, nsamples=2)
#print('Done 2')



In [None]:
## plot the SHAP values for the Setosa output of the first instance
#shap.force_plot(explainer.expected_value[0], shap_values[0][20,:], numerical_data[20,:])

---

## Conclusion

In [None]:
reload(descr)
descr.print_conclusion(spn, dictionary, corr, nodes, separations, expl_vectors)

## Dive into the data

Use the Facets Interface to visualize data on your own. You can either load the dataset itself, or show the data as predicted by the model.

In [None]:
# Load UCI census and convert to json for sending to the visualization
import pandas as pd
df = pd.read_csv(dataset)
jsonstr = df.to_json(orient='records')

# Display the Dive visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

## Build your own queries

This notebook enables you to add your own analysis to the above. Maybe you are interested in drilling down into specific subclusters of the data, or you want to predict additional datapoint not represented in the training data.

In [None]:
from spn.algorithms.Inference import likelihood

# get samples to predict
data_point = numerical_data[1:2]
# get the probability from the models joint probability function
proba = likelihood(spn, data_point)


printmd(data_point)
printmd(likelihood(spn, data_point))

You can also predict the probability of several data points at the same time.

In [None]:
data_point = numerical_data[0:3]
proba = likelihood(spn, data_point)

printmd(data_point)
printmd(proba)

In [None]:
from src import dn_plot
import numpy as np
from src.util.spn_util import func_from_spn
from spn.algorithms import Inference

idx1 = df.columns.get_loc('pe_ecg004')
idx2 = df.columns.get_loc('di_dia002')

detail = 100

x_range = np.linspace(context.domains[idx1][0], context.domains[idx1][1], detail)
values = [0,1,2]

all_res = []

for i in values:
    full_data = np.full((detail, df.values.shape[1]), np.nan)
    full_data[:, idx1] = x_range
    evidence = Inference.likelihood(spn, full_data)
    full_data[:, idx2] = i
    likelihood = Inference.likelihood(spn, full_data)
    all_res.append((likelihood/evidence).squeeze())

data = [Scatter(
        x=x_range,
        y=all_res[i],
        mode='lines',
    ) for i, _ in enumerate(values)]

layout = dict(width=450,
              height=450,
              xaxis=dict(title=context.feature_names[idx1]),
              yaxis=dict(title='Conditional probability')
             )


iplot({'data': data, 'layout': layout})

In [None]:
from src import dn_plot
import numpy as np
from src.util.spn_util import func_from_spn
from spn.algorithms import Inference

idx1 = df.columns.get_loc('fi_tre002')
idx2 = df.columns.get_loc('di_dia002')

detail = 100

x_range = np.linspace(context.domains[idx1][0], context.domains[idx1][1], detail)
values = [0,1,2]

all_res = []

for i in values:
    full_data = np.full((detail, df.values.shape[1]), np.nan)
    full_data[:, idx1] = x_range
    evidence = Inference.likelihood(spn, full_data)
    full_data[:, idx2] = i
    likelihood = Inference.likelihood(spn, full_data)
    all_res.append((likelihood/evidence).squeeze())

data = [Scatter(
        x=x_range,
        y=all_res[i],
        mode='lines',
    ) for i, _ in enumerate(values)]

layout = dict(width=450,
              height=450,
              xaxis=dict(title=context.feature_names[idx1]),
              yaxis=dict(title='Conditional probability')
             )


iplot({'data': data, 'layout': layout})

In [None]:
from src import dn_plot
import numpy as np
from src.util.spn_util import func_from_spn
from spn.algorithms import Inference

idx1 = df.columns.get_loc('qu_mpa001')
idx2 = df.columns.get_loc('di_dia002')

detail = 100

x_range = np.linspace(context.domains[idx1][0], context.domains[idx1][1], detail)
values = [0,1,2]

all_res = []

for i in values:
    full_data = np.full((detail, df.values.shape[1]), np.nan)
    full_data[:, idx1] = x_range
    evidence = Inference.likelihood(spn, full_data)
    full_data[:, idx2] = i
    likelihood = Inference.likelihood(spn, full_data)
    all_res.append((likelihood/evidence).squeeze())

data = [Scatter(
        x=x_range,
        y=all_res[i],
        mode='lines',
    ) for i, _ in enumerate(values)]

layout = dict(width=450,
              height=450,
              xaxis=dict(title=context.feature_names[idx1]),
              yaxis=dict(title='Conditional probability')
             )


iplot({'data': data, 'layout': layout})