# Introduction

## Using this notebook

This notebook helps you to analyze datasets and to find interesting and meaningful patterns in the data. If you are only interested in looking at an automated report outlining the most important features of your dataset, you can upload your datafile via the *dataset* variable and run the notebook. Afterwards, you can export the report as HTML and read it in a webbrowser.

If you are interested in a more interactive analysis of your data, you can also adapt the parameters of the notebook to suit your needs. Each section conatins several values which can be adapted to your needs. These values are described in the code comments.

Finally, if you want to go beyond an automated report and answer your own questions, you can look at the final section of the notebook and use the code examples there to generate your own figures and analysis from the data model.

### Reading this report in a webbrowser

This report uses several statistical methods and specific phrases and concepts from the domains of statistics and machine learning. Whenever such methods are used, a small "Explanation" sign at the side of the report marks a short explanation of the methods and phrases. Clicking it will reveal the explanation.

You can toggle the global visibility of these explanations with a button at the top left corner of the report. The code can also be toggled with a button.

All graphs are interactive and will display additional content on hover. You can get the exact values of the functions by selecting the assoziated areas in the graph. You can also move the plots around and zoom into interesting parts.

### Aknowledgments

This notebook is build on the MSPN implementation by Molina et.al. during the course of a bachelor thesis under the supervision of Alejandro Molina and Kristian Kersting at TU Darmstadt. The goal of this framework is to sum product networks for hybrid domains and to highlight important aspects and interesting features of a given dataset.

In [11]:
import pickle
import pandas as pd
import numpy as np

#from tfspn.SPN import SPN
from pprint import PrettyPrinter
from IPython.display import Image
from IPython.display import display, Markdown
from importlib import reload

import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

from src.util.text_util import printmd, strip_dataset_name
import src.ba_functions as f
import src.dn_plot as p
import src.dn_text_generation as descr
import src.util.data_util as util
from src.util.spn_util import get_categoricals

from src.util.CSVUtil import learn_piecewise_from_file, load_from_csv

init_notebook_mode(connected=True)
# pp = PrettyPrinter()

In [12]:
# path to the dataset you want to use for training
dataset = 'example_data/titanic.csv'

# the minimum number of datapoints that are included in a child of a 
# sum node
min_instances = 50

# the parameter which governs how strict the independence test will be
# 1 results in all features being evaluated as independent, 0 will 
# result in no features being acccepted as truly independent
independence_threshold = 0.3


spn, dictionary = learn_piecewise_from_file(
    data_file=dataset, 
    header=0, 
    min_instances=min_instances, 
    independence_threshold=independence_threshold, )
df = pd.read_csv(dataset)
context = dictionary['context']
context.dataset = strip_dataset_name(dataset)
categoricals = get_categoricals(spn, context)

In [15]:
# path to the model pickle file
model_path = "deep_notebooks/models/test.pickle"

# UNCOMMENT THE FOLLOWING LINES TO LOAD A MODEL
spn = pickle.load(open('../myokardinfarkt/spn_save.txt', 'rb'))
df, _, dictionary = load_from_csv('../myokardinfarkt/data/cleaned_pandas.csv', header = 0)
context = pickle.load(open('../myokardinfarkt/context_save.txt', 'rb'))
context.feature_names = ([entry['name']
                                  for entry in dictionary['features']])
dictionary[context] = context

In [16]:
from spn.structure.Base import get_topological_order
get_topological_order(spn)

[HistogramNode_36,
 PiecewiseLinearNode_37,
 HistogramNode_38,
 PiecewiseLinearNode_39,
 HistogramNode_40,
 PiecewiseLinearNode_41,
 HistogramNode_42,
 PiecewiseLinearNode_43,
 HistogramNode_44,
 PiecewiseLinearNode_45,
 HistogramNode_46,
 PiecewiseLinearNode_47,
 HistogramNode_48,
 PiecewiseLinearNode_49,
 HistogramNode_50,
 PiecewiseLinearNode_51,
 HistogramNode_52,
 PiecewiseLinearNode_53,
 HistogramNode_54,
 PiecewiseLinearNode_55,
 HistogramNode_60,
 PiecewiseLinearNode_61,
 HistogramNode_65,
 PiecewiseLinearNode_66,
 HistogramNode_67,
 PiecewiseLinearNode_68,
 HistogramNode_69,
 PiecewiseLinearNode_70,
 HistogramNode_71,
 PiecewiseLinearNode_72,
 HistogramNode_73,
 PiecewiseLinearNode_74,
 HistogramNode_75,
 PiecewiseLinearNode_76,
 HistogramNode_77,
 PiecewiseLinearNode_78,
 HistogramNode_79,
 PiecewiseLinearNode_80,
 HistogramNode_81,
 PiecewiseLinearNode_82,
 HistogramNode_83,
 PiecewiseLinearNode_84,
 HistogramNode_85,
 PiecewiseLinearNode_86,
 HistogramNode_89,
 PiecewiseLin

In [18]:
context.dataset = 'myokardinfarkt'
reload(descr)
descr.introduction(context)

# Exploring the myokardinfarkt dataset

<figure align="right" style="padding: 1em; float:right; width: 300px">
	<img alt="the logo of TU Darmstadt"
		src="images/tu_logo.gif">
	<figcaption><i>Report framework created @ TU Darmstadt</i></figcaption>
        </figure>
This report describes the dataset myokardinfarkt and contains general statistical
information and an analysis on the influence different features and subgroups
of the data have on each other. The first part of the report contains general
statistical information about the dataset and an analysis of the variables
and probability distributions.<br/>
The second part focusses on a subgroup analysis of the data. Different
clusters identified by the network are analyzed and compared to give an
insight into the structure of the data. Finally the influence different
variables have on the predictive capabilities of the model are analyzes.<br/>
The whole report is generated by fitting a sum product network to the
data and extracting all information from this model.

## General statistical evaluation

In [19]:
reload(descr)
descr.data_description(context, df)

TypeError: object of type 'NoneType' has no len()

Below, the means and standard deviations of each feature are shown. Categorical 
features do not have a mean and a standard deviation, since they contain no ordering. Instead, 
the network returns NaN.

In [20]:
descr.means_table(spn, context)

TypeError: object of type 'NoneType' has no len()

In the following section, the marginal distributions for each feature is shown. This 
is the distribution of each feature without knowing anything about the other values.

In [9]:
reload(descr)
descr.features_shown = 'all'

descr.show_feature_marginals(spn, dictionary)

IndexError: list index out of range

### Correlations

To get a sense of how the features relate to one another, the correlation between 
them is analyzed in the next section. The correlation denotes how strongly two features are 
linked. A high correlation (close to 1 or -1) means that two features are very closely related, 
while a correlation close to 0 means that there is no linear interdependency between the features.

The correlation is reported in a colored matrix, where blue denotes a negative and red denotes 
a positive correlation.

In [8]:
descr.correlation_threshold = 0.4

corr = descr.correlation_description(spn, dictionary)

8



invalid value encountered in greater


invalid value encountered in less



No features show more then a very weak correlation.

The conditional distributions are the probabilities of the features, given 
a certain instance of a class. The joint probability functions of correlated variables 
are shown below to allow a more in-depth look into the dependency.

In [9]:
reload(descr)

descr.correlation_threshold = 0
descr.feature_combinations = 'all'
descr.show_conditional = True

descr.categorical_correlations(spn, dictionary)

8


There is a weak  dependency between "Pclass" and "Survived".

The model shows a weak  relationship for the features "Gender" and "Pclass".

"Age" and "Survived" have a weak  interdependency.

The features "Age" and "Pclass" have a weak  relationship.

"Age" and "Gender" have a weak  dependency.

The features "SibSp" and "Survived" have a weak  dependency.

There is a weak  relationship between the features "SibSp" and "Pclass".

The features "SibSp" and "Gender" have a weak  relationship.

"SibSp" and "Age" have a weak  dependency.

The model shows a weak  relationship for "Parch" and "Survived".

There is a weak  interdependency between "Parch" and "Pclass".

The features "Parch" and "Gender" have a weak  dependency.

"Parch" and "Age" have a weak  interdependency.

There is a weak  relationship for the features "Parch" and "SibSp".

The features "Fare" and "Survived" have a weak  relation.

"Fare" and "Pclass" have a weak  relation.

There is a weak  interdependency for "Fare" and "Gender".

The features "Fare" and "Age" have a weak  dependency.

The model shows a weak  dependency for the features "Fare" and "SibSp".

"Fare" and "Parch" have a weak  dependency.

The model shows a weak  relationship for "Embarked" and "Pclass".

There is a weak  dependency between "Embarked" and "Age".

The model shows a weak  interdependency between "Embarked" and "SibSp".

The model shows a weak  relationship for "Embarked" and "Parch".

There is a weak  dependency between "Embarked" and "Fare".

---

## Cluster evaluation

To give an impression of the data representation as a whole, the complete network graph is 
shown below. The model is a tree, with a sum node at its center. The root of the tree is shown 
in white, while the sum and product nodes are green and blue respectively. Finally, all 
leaves are represented by red nodes.

In [10]:
#p.plot_graph(spn=spn, fname='deep_notebooks/images/graph.png', context=context)
#display(Image(filename='deep_notebooks/images/graph.png', width=400))

The data model provides a clustering of the data points into groups in which features are 
independent. The groups extracted from the data are outlined below together with a short 
description of the data they cover. Each branch in the model represents one cluster found 
in the data model.

### Description of all clusters

In [11]:
# possible values: 'all', 'big', int (leading to a random sample), list of nodes to be displayed
nodes = f.get_sorted_nodes(spn)

reload(descr)
descr.nodes = 'all'
descr.show_node_graphs = False

descr.node_introduction(spn, nodes, context)

The SPN contains 5 clusters.


These are:

- a deep Product Node, representing 79.63% of the data.
  - The node has 2 children and 381 descendants,                    resulting in a remaining depth of 7.


- a deep Product Node, representing 7.02% of the data.
  - The node has 8 children and 24 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 5.34% of the data.
  - The node has 8 children and 24 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 4.78% of the data.
  - The node has 8 children and 24 descendants,                    resulting in a remaining depth of 3.


- a deep Product Node, representing 3.23% of the data.
  - The node has 8 children and 24 descendants,                    resulting in a remaining depth of 3.


The node representatives are the most likely data points for each node.            They are archetypal for what the node represents and what subgroup of            the data it encapsulates.

As stated above, each cluster captures a subgroup of the data. To show what variables are 
captured by which cluster, the means and variances for each feature and subgroup are plotted below. 
This highlights where the node has its focus.

In [12]:
descr.features_shown = 'all'
descr.mean_threshold = 0.1
descr.variance_threshold = 0.1
descr.separation_threshold = 0.1

separations = descr.show_node_separation(spn, nodes, context)

The feature "Pclass" is weakly separated by the clustering. The variances of the nodes 0, 1 are significantly larger then the average node. The mean of node 0 is significantly larger then the average node. The means of the nodes 1, 2, 3, 4 are significantly smaller then the average node.

The feature "Age" is weakly separated by the clustering. The variances of the nodes 0, 1, 2 are significantly larger then the average node. The means of the nodes 1, 4 are significantly larger then the average node. The mean of node 3 is significantly smaller then the average node.

The feature "SibSp" is weakly separated by the clustering. The variances of the nodes 0, 1 are significantly larger then the average node. The means of the nodes 1, 4 are significantly larger then the average node. The mean of node 3 is significantly smaller then the average node.

The feature "Parch" is weakly separated by the clustering. The variances of the nodes 0, 1 are significantly larger then the average node. The means of the nodes 1, 3, 4 are significantly smaller then the average node.

The feature "Fare" is weakly separated by the clustering. The variances of the nodes 0, 2 are significantly larger then the average node. The means of the nodes 1, 2, 3, 4 are significantly larger then the average node. The mean of node 0 is significantly smaller then the average node.

An analysis of the 
distribution of categorical variables is given below. If a cluster or a group of clusters 
capture a large fraction of the total likelihood of a categorical instance, they can be 
interpreted to represent this instance and the associated distribution.

In [13]:
reload(descr)

descr.categoricals = 'all'

descr.node_categorical_description(spn, dictionary)

#### Distribution of Survived

87.46% of "['no']" is captured by the nodes 1. The probability of                        "['no']" for this group of nodes is 57.0%

100.0% of "['yes']" is captured by the nodes 0, 1, 2, 3, 4. The probability of                        "['yes']" for this group of nodes is 50.0%

#### Distribution of Gender

100.0% of "['female']" is captured by the nodes 0, 1, 2, 3, 4. The probability of                        "['female']" for this group of nodes is 50.0%

91.39% of "['male']" is captured by the nodes 1, 2. The probability of                        "['male']" for this group of nodes is 53.23%

#### Distribution of Embarked

The feature "Embarked" is not separated well along the primary                        clusters.

### Correlations by cluster

Finally, since each node captures different interaction between the features, it is 
interesting to look at the correlations again, this time for the seperate nodes. Shallow 
nodes are omitted, because the correlation of independent variables is always 0.

In [14]:
reload(descr)

descr.correlation_threshold = 0.1
descr.nodes = 'all'

descr.node_correlation(spn, dictionary)

### Correlations for node 0

8


No features show more then a very weak correlation.

### Correlations for node 1

8


The model shows a weak positive relation between "Age" and "Survived". "Age" and "Pclass" influence each other weakly. As one increases, the other decreases. "Parch" and "SibSp" influence each other weakly. As one increases, the other increases.

All other features do not have more then a very weak correlation.

### Correlations for node 2

8


No features show more then a very weak correlation.

### Correlations for node 3

8


No features show more then a very weak correlation.

### Correlations for node 4

8


No features show more then a very weak correlation.

---
## Predictive data analysis

In [15]:
reload(util)
numerical_data, categorical_data = util.get_categorical_data(spn, df, dictionary)

After the cluster description, the data model is used to predict data points. To evaluate 
the performance of the model, the misclassification rate is shown below.

The classified data points are used to analyze more advanced patterns within the data, by looking
first at the misclassified points, and then at the classification results in total.

In [16]:
descr.classify = 'all'

misclassified, data_dict = descr.classification(spn, numerical_data, dictionary)

0


For feature "Survived" the SPN misclassifies 247 instances, resulting in a precision of 65.31%.

2


For feature "Gender" the SPN misclassifies 244 instances, resulting in a precision of 65.73%.

7


For feature "Embarked" the SPN misclassifies 158 instances, resulting in a precision of 77.81%.

Below, the misclassified examples are explained using the clusters they are most assoiciated with.
For each instance, those clusters which form 90 % of the prediction are reported together eith the
representatives of these clusters.

In [17]:
# IMPORTANT: Only set use_shapley to true if you have a really powerful machine
reload(descr)
reload(p)

descr.use_shapley = False
descr.shapley_sample_size = 1
descr.misclassified_explanations = 1

descr.describe_misclassified(spn, dictionary, misclassified, data_dict, numerical_data)

Instance 122 was predicted as "['yes']", even though it is "['no']", because it was most similar to the following clusters: 1, 2, 0

Instance 578 was predicted as "['male']", even though it is "['female']", because it was most similar to the following clusters: 1, 2, 0

Instance 673 was predicted as "['S']", even though it is "['C']", because it was most similar to the following clusters: 1, 2, 0

### Information gain through features

The following graphs highlight the relative importance of different features for a 
classification. It can show how different classes are predicted. For continuous and
discrete features, a high positvie or negative importance shows that changing this features
value positive or negative increases the predictions certainty.

For categorical values, positive and negative values highlight whether changing or keeping
this categorical value increases or decreasies the predictive certainty.

In [None]:
reload(descr)
reload(p)

descr.explanation_vector_threshold = 0
descr.explanation_vector_classes = None
descr.explanation_vectors_show = 'all'

expl_vectors = descr.explanation_vector_description(spn, dictionary, data_dict, categoricals)

#### Class "Survived": "['no']"

##### Predictive feature "Pclass"



The feature "Pclass" is a very weak negative predictor for the instance                       "['no']" of class "Survived".
Generally, a higher value for this feature                     will increases the class probability. 

 The feature influences the classification {strength_adv} for data points centered around {mean}.

No fitting instances found.


##### Predictive categorical feature "Gender": "['male']"



##### Predictive feature "Age"



The feature "Age" is a very weak negative predictor for the instance                       "['no']" of class "Survived".
The feature influences the classification weakly for data points centered around 21.72 and weakly for points centered around 35.61.

##### Predictive feature "SibSp"



The feature "SibSp" is a very weak negative predictor for the instance                       "['no']" of class "Survived".
For data points centered around 0.32 the feature influences the classification weakly. The influence of the feature on the classification is weak for data points centered around 0.35.

##### Predictive feature "Parch"



The feature "Parch" is a weak positive predictor for the instance                       "['no']" of class "Survived".
The classification is weakly influenced by the feature for data points centered around 0.57. The impact on the classification is weak for data points centered around 0.07. For data points centered around 0.15 the influence on the classification is very strong.

##### Predictive feature "Fare"



The feature "Fare" is a weak positive predictor for the instance                       "['no']" of class "Survived".
For data points centered around 78.0 the influence on the classification is weak. The impact on the classification is weak for data points centered around 13.95. The classification is strongly influenced for data points centered around 59.48. The impact on the classification is very strong for data points centered around 57.98.

##### Predictive categorical feature "Embarked": "['S']"



#### Class "Survived": "['yes']"

##### Predictive feature "Pclass"



The feature "Pclass" is a very weak negative predictor for the instance                       "['yes']" of class "Survived".
Generally, a higher value for this feature                     will increases the class probability. 

 The feature influences the classification {strength_adv} for data points centered around {mean}.

##### Predictive categorical feature "Gender": "['male']"



##### Predictive feature "Age"



The feature "Age" is a very weak positive predictor for the instance                       "['yes']" of class "Survived".
The feature influences the classification weakly for data points centered around 20.05. The feature influences the classification weakly for data points centered around 29.71.

##### Predictive feature "SibSp"



The feature "SibSp" is a very weak positive predictor for the instance                       "['yes']" of class "Survived".
For data points centered around 0.69 the impact on the classification is weak. The feature influences the classification weakly for data points centered around 0.61.

##### Predictive feature "Parch"



The feature "Parch" is a moderate positive predictor for the instance                       "['yes']" of class "Survived".
For data points centered around 0.62 the classification is very strongly influenced. The feature influences the classification weakly for data points centered around 1.06 but weakly for points centered around 0.45. The impact of it on the classification is moderate for data points centered around 0.1. The classification is very strongly influenced for data points centered around 0.58.

##### Predictive feature "Fare"



The feature "Fare" is a strong positive predictor for the instance                       "['yes']" of class "Survived".
For data points centered around 53.03 the classification is weakly influenced by the feature. The influence on the classification is weak for data points centered around 38.76. The feature influences the classification moderately for data points centered around 40.99. For data points centered around 69.63 the feature influences the classification strongly and very strongly for points centered around 28.75.

##### Predictive categorical feature "Embarked": "['S']"



#### Class "Gender": "['female']"

##### Predictive categorical feature "Survived": "['yes']"




divide by zero encountered in log



##### Predictive feature "Pclass"



The feature "Pclass" is a moderate negative predictor for the instance                       "['female']" of class "Gender".
For data points centered around 3.0 the feature influences the classification very strongly but weakly for points centered around 1.0. For data points centered around 1.0 the classification is strongly influenced by it. For data points centered around 1.0 the feature influences the classification very strongly.

##### Predictive feature "Age"



The feature "Age" is a very weak negative predictor for the instance                       "['female']" of class "Gender".
Generally, a higher value for this feature                     will increases the class probability. 

 The influence on the classification is {strength} for data points centered around {mean}.

##### Predictive feature "SibSp"



The feature "SibSp" is a very weak positive predictor for the instance                       "['female']" of class "Gender".
The classification is weakly influenced for data points centered around 0.91. For data points centered around 0.8 the feature influences the classification weakly.

##### Predictive feature "Parch"



The feature "Parch" is a weak positive predictor for the instance                       "['female']" of class "Gender".
For data points centered around 1.0 the influence on the classification is weak. For data points centered around 1.0 the classification is weakly influenced but very strongly for points centered around 1.0.

##### Predictive feature "Fare"



The feature "Fare" is a very weak positive predictor for the instance                       "['female']" of class "Gender".
The impact of it on the classification is weak for data points centered around 124.34. The classification is weakly influenced for data points centered around 47.75.

##### Predictive categorical feature "Embarked": "['S']"



#### Class "Gender": "['male']"

##### Predictive categorical feature "Survived": "['yes']"



---

## Conclusion

In [40]:
reload(descr)
descr.print_conclusion(spn, dictionary, corr, nodes, separations, expl_vectors)

This concludes the automated report on the SumNode_0 dataset.

The initial findings show, that the following variables have a significant connections with each other.



The intial clustering performed by the algorithm seperates the following features well:





If you want to explore the dataset further, you can use the interactive notebook to add your own queries to the network.
    

## Dive into the data

Use the Facets Interface to visualize data on your own. You can either load the dataset itself, or show the data as predicted by the model.

In [41]:
# Load UCI census and convert to json for sending to the visualization
import pandas as pd
df = pd.read_csv(dataset)
jsonstr = df.to_json(orient='records')

# Display the Dive visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

## Build your own queries

This notebook enables you to add your own analysis to the above. Maybe you are interested in drilling down into specific subclusters of the data, or you want to predict additional datapoint not represented in the training data.

In [42]:
from spn.algorithms.Inference import likelihood

# get samples to predict
data_point = numerical_data[1:2]
# get the probability from the models joint probability function
proba = likelihood(spn, data_point)


printmd(data_point)
printmd(likelihood(spn, data_point))

[[ 0.      3.      1.      7.      4.      1.     39.6875  2.    ]]

[[9.35769131e-13]]

You can also predict the probability of several data points at the same time.

In [43]:
data_point = numerical_data[0:3]
proba = likelihood(spn, data_point)

printmd(data_point)
printmd(proba)

[[ 0.      3.      0.     18.      1.      0.     17.8     2.    ]
 [ 0.      3.      1.      7.      4.      1.     39.6875  2.    ]
 [ 0.      3.      1.     21.      0.      0.      7.8     2.    ]]

[[8.34493582e-13]
 [9.35769131e-13]
 [4.92877600e-14]]