# Module 2B (Part 2): Bayesian Networks

A Bayesian network (BN) is composed of random variables (nodes) and their conditional dependencies (arcs) which, together, form a directed acyclic graph (DAG). They have become a widely used method in the modelling of uncertain knowledge. A conditional probability table (CPT) is associated with each node. It contains the conditional probability distribution of the node given its parents in the DAG:

<img src='images/Wetgrass.png' style='width: 450px;' />

Basically, each node represents a random variable, which is decribed by a probability distribution over its parents' nodes. The biggest advantage of the Bayesian network is its compact and modular strucute. Humans do not have access to all the probability distributions and all variables of the world. For this reason, in order to make probabilisitc inferences, humans need to combine different sources of evidence in order to come up with an answer. This is precisely what Bayesin Networks do, using elaborate probabilistic formulas based on the Naive Bayes that we jsut saw. It is not relevant to go through the mathematics. Bayesian Networks are graphical structures that enable any non-expert to use them in daily decision making tasks.

## Creating a Bayesian Network in Python

Consider the Bayesian Network, which describes the following decision scenario. 

This network trys to expresss the probability of a person having wither Tuberculosis, Lung Cancer or Bronchitis, given some symptoms, Shortness in Breath (Dispnea), exames (like a positive xray result) and some historical information: visits to Asia and Smoking.


<img src='images/asia.png' style='width: 450px;' />

The random variables of this network are:
* Visit to Asia
* Tuberculosis
* Either Tuberculosi or Lung Cancer
* Positive X-Ray
* Dispnea
* Bronchitis
* Smoker

**QUESTION. Which of the above random variables are root notes of the Network? (A root node is a node that does not descend from any other node)**

**Answer:** 
???


### Create the Network Structure in Python

We start by importing into Python the necessary libraries to work in this notebook

In [None]:
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt
import os

#Python Library that deals with Bayesan Networks (BNs)
import pyAgrum as bn_graphs
import pyAgrum.lib.notebook as gnb

from pyAgrum.lib.bn2roc import showROC

import seaborn as sns
sns.set()


Next, we define our Bayesian Network. As wyou can see, it is an empty structure (for now...)

In [None]:
bn = bn_graphs.BayesNet('CancerBN') #Creates an empty network called CancerBN
print(bn)                           #Prints the created BN


The above code builds a general network structure, but with no nodes or edges or conditional probability tables.
Our next step will be precisely to specify these variables.

### Create the Random Variables (the nodes)

To create a random variable, we need to use pyAgrum's function *LabelizedVariable*, which is a variable whose domain is a finite set of labels. You can do it in the following way:

In [None]:
#The function LabelizedVariable( id_name, label, cardinality), receives the follwoing arguments
#name: is a string that uniquely identifies the node
#cardinality: is an integer which specifies the amount of different values that the 
#             the random varible will have. We will set this for 0 now

id_name = 'LungCancer'
label = 'LungCancer'
LungCancer = bn_graphs.LabelizedVariable(id_name, label, 0)
print(LungCancer)

We can now specify the labels "true" and "false" of our random variable. For this, we use the method *addLabel()*

In [None]:
#In our example, we have a random variable 'LungCancer' which can have the values 
LungCancer.addLabel('present')   #'present' if LungCancer occured or 

LungCancer.addLabel('absent')    #'absent' if LungCancer did not occur.
print(LungCancer)

We can now add the created random variable to our network by using the method .add()

In [None]:
bn.add( LungCancer )

In [None]:
#Create a list with the names of the nodes
nodes_lst = ['VisitAsia','Smoker', 'Tuberculosis', 'Bronchitis', 'Dispnea', 'PositiveXray', 'TubercOrLungCan']
print(nodes_lst)


In [None]:
#Creates a new node for each of the random variables in nodes_lst

# node is a variable  that will go through each entry of the list nodes_lst
# the iterations are performed by the function *for* in the following way:
# node = 'Visit Asia' ............. iteration #1
# node = 'Smoker' ................. iteration #2
# node = 'Tuberculosis' ........... iteration #3
# node = 'Bronchitis' ............. iteration #4
# node = 'Dispnea' ................ iteration #5
# node = 'Positive Xray' .......... iteration #6
# node = 'TubercOrLungCan' ........ iteration #7
for node in nodes_lst:
    print(node)
    var = bn_graphs.LabelizedVariable(node, node, 0)  #creates random variable
    var.addLabel('present')                           #adds the label 'true'
    var.addLabel('absent')                            #adds the label 'false'
    bn.add(var)                                       #adds the created var to the network

print(bn)


### Create the Edges (the arcs between nodes)

Now that we have defined our nodes, we need to define the arcs between them. For this, we use the method addArc( sourceNode, targetNode ):

In [None]:
# Arc between 
bn.addArc('VisitAsia', 'Tuberculosis')
print(bn)


In [None]:
arc_lst = [ ('Tuberculosis', 'TubercOrLungCan'), ('LungCancer', 'TubercOrLungCan'), ('Smoker','LungCancer' ), ('Smoker', 'Bronchitis'), ('Bronchitis', 'Dispnea'), ('TubercOrLungCan', 'Dispnea'), ('TubercOrLungCan', 'PositiveXray' ) ]
print( arc_lst )


In [None]:
#Creates a new edge for each of the random variables in arc_lst

# arc is a variable that will go through each entry of the list arc_lst
# the iterations are performed by the function *for* in the following way:
# arc = ('Tuberculosis', 'TubercOrLungCan') ............ iteration #1
# arc = ('Lung Cancer', 'TubercOrLungCan') ............. iteration #2
# arc = ('Smoker', 'Lung Cancer') ............. iteration #3
# arc = ('Smoker', 'Bronchitis') ............. iteration #4
# arc = ('Bronchitis', 'Dispnea') ............. iteration #5
# arc = ('TubercOrLungCan', 'Dispnea') ............. iteration #6
# arc = ('TubercOrLungCan', 'Positive Xray') ............. iteration #3
for arc in arc_lst:
    bn.addArc( arc[0],  arc[1] )      #adds the created arc to the network

print(bn)

### Display your Network

In [None]:
bn

### Define the Conditional Probability Tables

Onde we have the structure of the network, we need to specify the conditional probability tables (CPTs). In Python, each CPT is referred to as a *Potential*.

There are several ways to fill these CPTs. In this workshop, we will show you some of them.

### Low Level Method

Filling the conditional probability table of the root node: Burglary

In [None]:
#Fill the conditional probability table of the variable 
#VisitAsia according to Figure 1: Pr(VisitAsia=present)  = 0.01
#                                 Pr(VisitAsia=absent) = 1 - 0.01 = 0.99
bn.cpt('VisitAsia').fillWith( [0.01, 1-0.01] )

Filling the conditional probability table of the root node: Earthquake

In [None]:
# Fill the conditional probability table of the variable 
# Smoker according to Figure 1: Pr(Smoker=present)  = 0.5
#                               Pr(Smoker=absent) = 1 - 0.5 = 0.5
bn.cpt('Smoker').fillWith( [0.5, 1-0.5] )

The most convinient way to fill conditional probability tables is by using dictionaries in Python. This is done in the following way for variable TubercOrLungCan:

In [None]:
bn.cpt( 'TubercOrLungCan' )[ {'LungCancer': 'present',  'Tuberculosis': 'present'}  ] = [1, 0]
bn.cpt( 'TubercOrLungCan' )[ {'LungCancer': 'present',  'Tuberculosis': 'absent'} ] = [1, 0]
bn.cpt( 'TubercOrLungCan' )[ {'LungCancer': 'absent', 'Tuberculosis': 'present'}  ] = [1, 0]
bn.cpt( 'TubercOrLungCan' )[ {'LungCancer': 'absent', 'Tuberculosis': 'absent'} ] = [0, 1]

bn.cpt('TubercOrLungCan')


**Try it yourself!** Can you write down the conditional probability tables for the node JohnCalls according to the probabilities in Figure 1? 

In [None]:
#JohnCalls
bn.cpt( 'Tuberculosis' )[ {'VisitAsia': 'present'}  ] = [ 0.05, 1 - 0.05  ]
bn.cpt( 'Tuberculosis' )[ {'VisitAsia': 'absent'} ] = [ 0.01, 1 - 0.01 ]

bn.cpt('Tuberculosis')


Try it yourself! Can you write down the conditional probability tables for the node MaryCalls according to the probabilities in Figure 1?

In [None]:
#JohnCalls
bn.cpt( 'LungCancer' )[ {'Smoker': 'present'}  ] = [ 0.1, 1 - 0.1  ]
bn.cpt( 'LungCancer' )[ {'Smoker': 'absent'} ] = [ 0.01, 1 - 0.01 ]

bn.cpt('LungCancer')


In [None]:
#Bronchitis
bn.cpt( 'Bronchitis' )[ {'Smoker': 'present'}  ] = [ 0.6, 1 - 0.6  ]
bn.cpt( 'Bronchitis' )[ {'Smoker': 'absent'} ] = [ 0.3, 1 - 0.3 ]

bn.cpt('Bronchitis')

In [None]:
bn.cpt( 'Dispnea' )[ {'Bronchitis': 'present',  'TubercOrLungCan': 'present'}  ] = [0.9, 1-0.9]
bn.cpt( 'Dispnea' )[ {'Bronchitis': 'present',  'TubercOrLungCan': 'absent'} ] = [1, 0]
bn.cpt( 'Dispnea' )[ {'Bronchitis': 'absent', 'TubercOrLungCan': 'present'}  ] = [0.7, 1-0.7]
bn.cpt( 'Dispnea' )[ {'Bronchitis': 'absent', 'TubercOrLungCan': 'absent'} ] = [0.8, 1-0.8]

bn.cpt('Dispnea')

In [None]:
# PositiveXray
bn.cpt( 'PositiveXray' )[ {'TubercOrLungCan': 'present'}  ] = [ 0.6, 1 - 0.6  ]
bn.cpt( 'PositiveXray' )[ {'TubercOrLungCan': 'absent'} ] = [ 0.3, 1 - 0.3 ]

bn.cpt('PositiveXray')

In [None]:
gnb.showInference( bn )

## Saving your Network

Well done! Your network is now complete! We can now save it in different formats. In this unit, we will use the format *.net* because it is the one that is widey used in the scientific community

In [None]:
import os

bn_graphs.saveBN( bn, os.path.join('data', 'Asia.net'))

To open the saved file:

In [None]:
bn_saved = bn_graphs.loadBN(os.path.join('data','Asia.net'))

In [None]:
bn_saved

## Inferences in Bayesian Networks

Probabilistic inference is the task of deriving the probability of one or more random variables taking a specific value or a specific set of values. For instance, we can use the Bayesian Network to *infer* the probability of the Lung Cancer being present given that a person Smokes:

$$Pr( LungCancer = present | Smokes = present ) =~?$$

To do this, we need to choose an algorithm to perform probabilistic inferences. There are two was to accomplish this in python:
- An exact method: **LazyPropagation**, which is usually applied for small networks
- An approximate method: **Gibbs**, which is usually applied for large networks.

In this unit, we will apply exact probabilistic methods, so we will use the **LazyPropagation** method. We can use it in the following way:

In [None]:
inference = bn_graphs.LazyPropagation(bn_saved)
print(inference)

### Inference Without Evidence

Inferenceswithout evidence are inferences in which you do not know anything about your decision scenario. All you variables are *unknown*. In other words, they are **not observed**. These are inferences of the type: whart is the probability of a person having Dispnea?

$$Pr( Dispnea = true ) =~?$$

We do this in Python in the following way:

In [None]:
inference.makeInference()
inference.posterior('Dispnea')

This table tells us that without any further information about our decision scenario, John is very unlikely to hear the alarm ring, and consequenlty, he will not call the police!

If you want to access these values individually, in Python, you proceed like this:

In [None]:
# Pr( Dispnea = present)
pr_Dispnea = inference.posterior('Dispnea')[0]
print('Pr( Dispnea = prsent ) = ' + str(pr_Dispnea))

# You can round this number to 4 decimal places
print('Pr( Dispnea = present ) = ' + str(round(pr_Dispnea,4)))


In [None]:
gnb.showProba(inference.posterior('Dispnea'))

**TRY IT YOURSELF** 
Can you answer the following queries?
$$Pr( Bronchitis = present ) =~?$$
$$Pr( Tuberculosis = present ) =~?$$
$$Pr( VisitAsia = present ) =~?$$
$$Pr( PositiveXray = absent ) =~?$$

In [None]:
# Answer:
pr_Bronchitis_present = inference.posterior('Bronchitis')[0]
print('Pr( Bronchitis = true ) = ' + str(round(pr_Bronchitis_present,4)))

pr_Tuberculosis_absent = inference.posterior('Tuberculosis')[1]
print('Pr( Tuberculosis = present ) = ' + str(round(pr_Tuberculosis_absent,4)))

pr_VisitAsia_present = inference.posterior('VisitAsia')[0]
print('Pr( VisitAsia = present ) = ' + str(round(pr_VisitAsia_present,4)))

pr_PositiveXray_absent = inference.posterior('PositiveXray')[0]
print('Pr( PositiveXray = absent ) = ' + str(round(pr_PositiveXray_absent,4)))

In [None]:
gnb.showProba(inference.posterior('Bronchitis'))
gnb.showProba(inference.posterior('Tuberculosis'))
gnb.showProba(inference.posterior('VisitAsia'))
gnb.showProba(inference.posterior('PositiveXray'))

### Inference with Eviddence

Bayesian Networks also allow us to make more complex questions (or queries) to the network. For instance, let's imagine that we know that a person resently visited Asia. What is now the probability of that person tuberculosis given this additional piece of information (i.e. this piece of evidence)?

$$Pr( Tuberculosis = present~|~VisitAsia = present ) =~?$$

In [None]:
# When we observe that an event occured, then we have a piece of evidence to give to our network.
# We can specify this by using the function setEvidence() and by specifying the observed variable and its state:
inference.setEvidence({'VisitAsia':'present'})

# Then, we just make the inference as presented before
inference.makeInference()
inference.posterior('VisitAsia')

In [None]:
gnb.showProba(inference.posterior('Tuberculosis'))

**Question** What happened to the probabilities? Knowing that a person went to Asia, what impact did this information cause in, for instance, the person getting Tuberculosis?  

**Answer**
Before we observed that a person Visited Asia, the probability of the person having tuberculosiswas:
$$Pr( Tuberculosis = present ) = 1\%$$

After observing that a person has been in Asia, the probability of Tuberculosis increased to:
$$Pr( Tuberculosis = present | VisitAsia = present ) = 5\%$$
Which is not very significant.

**Try it yourself** Knowing that a person has been in Asia recently and is shwing signs of shortness in breath (Dispnea), what happened to the probability distributions in the network.

In [None]:
# Answer:




### Visualizing All Inferences

Python also allows us to have a full visualizatin of the inferences of all variables

In [None]:
# Showing the full network when no variables are observed
pyAgrum.lib.notebook.showInference( bn_saved )

In [None]:
# Showing the full network when we observe that John Called the police
gnb.showInference( bn_saved, inference, {'Tuberculosis':'present'} )


## Creating a Bayesian Network Using Existing Data - The Titanic Challenge

The conditional probability tables can be manually inderted into the Bayesian Network if we have this knowledge (which usually is acquired from experts and general statistics). However, most of the times, we have a dataset and we need to fill these conditional probability tables using that dataset. In this section, we will guide you on how to achieve this. Note that whether we manually fill these CPTs or if we learn them using existing data, the topology of the network must always be defined before hand!

In this part of the notebook, we will show how one could have used a Bayesian Network to model the Titanic datase

In [None]:
import pandas
import os
import math
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
from pyAgrum.lib.bn2roc import showROC

This notebook present three different Bayesien Networks techniques to answer the Kaggle Titanic challenge. The first approach we will answer the challenge without using the training set and we will only use our prior knowledge about shipwrecks. In the second approach we will only use the training set with pyAgrum's machine learning algorithms. Finally, in the third approach we will use both prior knowledge about shipwrecks and machine learning.

### Pretreatment of Data
We will be using pandas to setup the learning data to fit with pyAgrum requirements.

In [None]:
traindf = pandas.read_csv(os.path.join('data', 'train.csv'))

testdf = pandas.merge(pandas.read_csv(os.path.join('data', 'test.csv')),
                    pandas.read_csv(os.path.join('data', 'gender_submission.csv')),
                    on="PassengerId")

This merges both the test base with the fact that a passager has survived or not.

In [None]:
traindf.var()

In [None]:
for k in traindf.keys():
    print('{0}: {1}'.format(k, len(traindf[k].unique())))

Looking at the number of unique values for each variable is necessary since Bayesian Networks are discrete models. We will want to reduce the domain size of some discrete varaibles (like age) and discretize continuous variables (like Fare).

For starters you can filter out variables with a large number of values. Choosing a large number will have an impact on performances, which boils down to how much CPU and RAM you have at your disposal. Here, we choose to filter out any variable with more than 10 different outcomes.

In [None]:
for k in traindf.keys():
    if len(traindf[k].unique())<=15:
        print(k)

This leaves us with 6 variables, not much but still enough to learn a Bayesian Network. Will just add one more variable by reducing the cardinality of the Age variable.


In [None]:
testdf=pandas.merge(pandas.read_csv(os.path.join('data', 'test.csv')),
                    pandas.read_csv(os.path.join('data', 'gender_submission.csv')),
                    on="PassengerId")

def forAge(row):
    try:
        age = float(row['Age'])
        if age < 1:
            #return '[0;1['
            return 'baby'
        elif age < 6:
            #return '[1;6['
            return 'toddler'
        elif age < 12:
            #return '[6;12['
            return 'kid'
        elif age < 21:
            #return '[12;21['
            return 'teen'
        elif age < 80:
            #return '[21;80['
            return 'adult'
        else:
            #return '[80;200]'
            return 'old'
    except ValueError:
        return np.nan
    
def forBoolean(row, col):
    try:
        val = int(row[col])
        if row[col] >= 1:
            return "True"
        else:
            return "False"
    except ValueError:
        return "False"
    
def forGender(row):
    if row['Sex'] == "male":
        return "Male"
    else:
        return "Female"
        

testdf

When pretreating data, you will want to wrap your changes inside a function, this will help you keep track of your changes and easily compare them.

In [None]:
def pretreat(df):
    if 'Survived' in df.columns:
        df['Survived'] = df.apply(lambda row: forBoolean(row, 'Survived'), axis=1).dropna()
    df['Age'] = df.apply(forAge, axis=1).dropna()
    df['SibSp'] = df.apply(lambda row: forBoolean(row, 'SibSp'), axis=1).dropna()
    df['Parch'] = df.apply(lambda row: forBoolean(row, 'Parch'), axis=1).dropna()
    df['Sex'] = df.apply(forGender, axis=1).dropna()
    droped_cols = [col for col in ['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'] if col in df.columns]
    df = df.drop(droped_cols, axis=1)
    df = df.rename(index=str, columns={'Sex': 'Gender', 'SibSp': 'Siblings', 'Parch': 'Parents'})
    return df

traindf = pandas.read_csv(os.path.join('data', 'train.csv'))
testdf  = pandas.merge(pandas.read_csv(os.path.join('data', 'test.csv')),
                       pandas.read_csv(os.path.join('data', 'gender_submission.csv')),
                       on="PassengerId")
traindf = pretreat(traindf)
testdf = pretreat(testdf)

We will need to save this intermediate learning database, since pyAgrum accepts only files as inputs. As a rule of thumb, save your CSV using comma as separators and do not quote values when you plan to use them with pyAgrum

In [None]:
import csv
traindf.to_csv(os.path.join('data', 'post_train.csv'), index=False)
testdf.to_csv(os.path.join('data', 'post_test.csv'), index=False)

In [None]:
testdf

## [1] Pre-Learning

We will now learn a Bayesian Network from the training set without any prior knowledge about shipwreks.

Before learning a Bayesian Network, we first need to create a template. This is not mandatory, however it is sometimes usefull since not all varaibles values are present in the learning base (in this example the number of relatives).

If during the learning step, the algorithm encounters an unknown value it will raise an error. This would be an issue if we wanted to automitize our classifier but, we will directly use values working with the test and learning base. This is not ideal but the objective here it to explore the data fast, not thoroughly.

To help creating de the template Bayesian Network that we will use to learn our classifier, let us firt recall all the variables wa have at our disposal.

In [None]:
df = pandas.read_csv(os.path.join('data', 'post_train.csv'))
for k in traindf.keys():
    print('{0}: {1}'.format(k, len(traindf[k].unique())))

In [None]:
template=gum.BayesNet()
template.add(gum.LabelizedVariable("Survived", "Survived", ['False', 'True']))
template.add(gum.RangeVariable("Pclass", "Pclass",1,3))
template.add(gum.LabelizedVariable("Gender", "The passenger's gender",['Female', 'Male']))
template.add(gum.LabelizedVariable("Siblings", "Siblings",['False', 'True']))
template.add(gum.LabelizedVariable("Parents", "Parents",['False', 'True']))
template.add(gum.LabelizedVariable("Embarked", "Embarked", ['', 'C', 'Q', 'S']))
template.add(gum.LabelizedVariable("Age", "The passenger's age category", ["baby", "toddler", "kid", "teen", "adult", "old"]))             
gnb.showBN(template)

### Learning from data
We can now learn our first Bayesian Network. As you will see, this is really easy.

In [None]:
file = os.path.join('data', 'post_train.csv')
learner = gum.BNLearner(file, template)
bn = learner.learnBN()
bn

### Exploring the Data

Now that we have a BayesNet, we can start looking how the variables corelate with each other. pyAgum offer the perfect tool for that: the information graph.



In [None]:
gnb.showInformation(bn,{},size="20")

To read this graph, you must understand what the entropy of a variable means: the hightest the value the more uncertain the variable marginal probability distrubition is (maximum entropy beging the equiprobable law). The lowest the value is, the more /certain/ the law is.

A consequence of how entropy is calculated, is that entropy tends to get bigger if the random varaible has many modalities.

What the information graph tells us is that the decade variable has a hight entropy. Thus, we can conclude that the passengers decade is distributed between all of its modalities.

What it also tells us, it that high modality variables with low entropy, such as Parch or SibSp, are not evenly distributed.

Let us look at he variables marginal probability by using the showInference() function

In [None]:
gnb.showInference(bn)

The showInference() is really usefull as it shows the marginal probability distribution for each random variable of a BayesNet.

We can now confirm what the entropy learned us: Parch and SibSp are unevenly distributed and decade is more evenly distributed.

Lets focus on the Titanic challenge now, and look at the Survived variable. We show a single posterior using the showPosterior() function.

In [None]:
gnb.showPosterior(bn,evs={},target='Survived')

So more than 40% of the passenger in our learning database survived.

So how can we use this BayesNet as a classifier ? Given a set of evidence, we can infer an update posterio distribution of the target variable Survived.

Lets look at the odds of surviving as a man in his thirties.

In [None]:
gnb.showPosterior(bn,evs={"Gender": "Male", "Age": 'adult'},target='Survived')

And now the odds of an old lady to survive

In [None]:
gnb.showPosterior(bn,evs={"Gender": "Female", "Age": 'old'},target='Survived')

Well, children and ladies first, that's right ?

One last information we will need is which variables are required to predict the Survived variable. To do, we will use the markov blanket of Survived.

In [None]:
gnb.sideBySide(bn, gum.MarkovBlanket(bn, 'Survived'), captions=["Learned Bayesian Network", "Markov blanket of 'Survived'"])

The Markov Blanket of the Survived variable tells us that we only need to observe Sex and Pclass in order to predict Survived. Not really usefull here but on larger Bayesian Networks it can save you a lot of time and CPU.

So how to use this BayesNet we have learned as a classifier ? We simply infer the posterior the Survive variable given the set of evidence we are given, and if the passanger odds of survival are above some value he will be taged as a survivor.

To compute the best value given the BayesNet and our training database, we can use the showROC() function.

In [None]:
showROC(bn, os.path.join('data', 'post_train.csv'), 'Survived', 'True', True, True)

In [None]:
ie=gum.LazyPropagation(bn)
init_belief(ie)
ie.addTarget('Survived')
result = testdf.apply(lambda x: is_well_predicted(ie, bn, 0.157935, x), axis=1)
result.value_counts(True)

In [None]:
positives = sum(result.map(lambda x: 1 if x.startswith("True") else 0 ))
total = result.count()
print("{0:.2f}% good predictions".format(positives/total*100))


## [3] Making a BN without learning data

In this last part we will combine both methods: we will force the BayesNet DAG and learn its parameters. We will assume the naive bayes hypothesis, which states that all random variables are independant conditionally to the target variable (here the variable Survived).

This results in the following topology.

In [None]:
bn = gum.BayesNet("Surviving Titanic")
bn.add(survived)
bn.add(age)
bn.add(gender)
bn.add(siblings)
bn.add(parents)
bn.addArc('Survived', 'Age')
bn.addArc('Survived', 'Gender')
bn.addArc('Survived', 'Siblings')
bn.addArc('Survived', 'Parents')
bn

The next step is to learn the parameters, this can easily be done using the learnParameters method.

In [None]:
learner = gum.BNLearner(os.path.join("data", 'post_train.csv'), bn)
bn = learner.learnParameters(bn.dag())
gnb.showInference(bn, size="10")

If we compare the CPTs obtained here with those defined by our expert in the first example we can see that they differ. They ressemble those obtained in the second example. This result is expected since we have learn the parameters from the training data, the learned probabilities distribution should match the data.

The final steps consists of confronting this model agains our test dataset.

In [None]:
showROC(bn, os.path.join('data', 'post_train.csv'), 'Survived', "True", True, True)

In [None]:
ie = gum.LazyPropagation(bn)
init_belief(ie)
ie.addTarget('Survived')
result = testdf.apply(lambda x: is_well_predicted(ie, bn, 0.35917266477065596, x), axis=1)
result.value_counts(True)

In [None]:
positives = sum(result.map(lambda x: 1 if x.startswith("True") else 0 ))
total = result.count()
print("{0:.2f}% good predictions".format(positives/total*100))

Naive Bayes perform well when used for classification tasks, as shown by the 95% of good predictions achieved by our third model.

### Conclusion

We have demonstradted with different classification techniques using Bayesian Networks. In the first approach, we mangaged to model a classifier without using any training set and relying solely on prior knowledge. In the second approach we used only machine learning techniques. Finally, in the third example we assumed the naive bayes hypothesis and obtained a model combined

## Try it yourlself!

Try to model the following network and come up with some analysis.

Scenario: You have a burglar alarm that is sometimes set off by minor earthquakes. You have two neighbours, John and Mary, who promised to call you if they hear the alarm.

Example of an inference task: suppose Mary calls you, but John does not, what is the probability that a burglary occured in your house?

<img src = "images/burglar_bn.png" width="500px" >