In [1]:
%load_ext autoreload
%autoreload 2

# Dirichlet prior as database

BNLearner gives access of many priors for the parameters and structural learning. One of them is the Dirichlet prior which needs a a prior for every possible parameter in a BN. aGrUM/pyAgrum allows to use a database as a source of Dirichlet prior.

In [2]:
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt

import os

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

sizePrior=30000
sizeData=30000

## generating databases for Dirichlet prior and for the learning 

In [3]:
bnPrior = gum.fastBN("A->B;C;D")
bnData = gum.fastBN("A->B->C->D")
bnData.cpt("B").fillWith([0.99,0.01,
                          0.01,0.99])
bnData.cpt("C").fillWith([0.99,0.01,
                          0.01,0.99])
bnData.cpt("D").fillWith([0.99,0.01,
                          0.01,0.99])
bnPrior.cpt("B").fillWith(bnData.cpt("B"))

gum.generateCSV(bnPrior, "dirichlet.csv", sizePrior, with_labels=True,random_order=True)

gum.generateCSV(bnData, "database.csv", sizeData, with_labels=True,random_order=False)

gnb.sideBySide(bnData,bnPrior,
               captions=[f"Database ({sizeData} cases)",f"Prior ({sizePrior} cases)"])

0,1
G A A B B A->B C C B->C D D C->D,G A A B B A->B C C D D
Database (30000 cases),Prior (30000 cases)


## Learning databases

In [None]:
# bnPrior is used to give the variables and their domains
learnerData = gum.BNLearner("database.csv") 
learnerPrior = gum.BNLearner("dirichlet.csv") 
learnerData.useScoreBIC()
learnerPrior.useScoreBIC()
gnb.sideBySide(learnerData.learnBN(),learnerPrior.learnBN(),
              captions=["Learning from Data","Learning from Prior"])

0,1
G A A B B B->A C C C->B D D D->C,G D D C C B B A A B->A
Learning from Data,Learning from Prior


## Learning with Dirichlet prior

Now we use the Dirichlet prior. In order to have an idea of the influence of the priori, we change the weights of Data and Prior from [0,1] to [1,0].

In [None]:
def learnWithRatio(ratio):
    # bnPrior is used to give the variables and their domains
    learner = gum.BNLearner("database.csv", bnPrior) 
    learner.useAprioriDirichlet("dirichlet.csv")
    learner.setAprioriWeight(ratio*sizePrior)
    learner.setDatabaseWeight((1-ratio)) #*sizeData)
    learner.useScoreBIC() # or another score with no included prior
    return learner.learnBN()

ratios=[0.0,0.01,0.05,0.2,0.5,0.8,0.9,0.95,0.99,1.0]
bns=[learnWithRatio(r) for r in ratios]
gnb.sideBySide(*bns,
              captions=[*[f"with ratio {r}<br/> [datasize : {r*sizePrior+(1-r)*sizeData}]" for r in ratios]])


0,1,2,3,4,5,6,7,8,9
G A A B B B->A C C C->B D D D->C,G A A B B A->B C C B->C D D B->D D->C,G A A B B A->B D D B->D C C C->A C->B C->D,G A A B B B->A C C B->C D D D->B D->C,G A A B B B->A D D B->D C C C->B C->D,G A A B B A->B C C B->C D D B->D C->D,G A A B B B->A C C B->C D D D->B D->C,G A A B B A->B C C A->C D D B->D D->C,G A A B B B->A C C D D,G A A B B B->A C C D D
with ratio 0.0  [datasize : 30000.0],with ratio 0.01  [datasize : 30000.0],with ratio 0.05  [datasize : 30000.0],with ratio 0.2  [datasize : 30000.0],with ratio 0.5  [datasize : 30000.0],with ratio 0.8  [datasize : 30000.0],with ratio 0.9  [datasize : 30000.0],with ratio 0.95  [datasize : 30000.0],with ratio 0.99  [datasize : 30000.0],with ratio 1.0  [datasize : 30000.0]


The BNs learned when mixing the 2 data sources look much more complex than the data and the Dirichlet structures (with $ratio \in [0.01,0.99]$). It may seem odd. However, if one looks at the mutual information,

In [None]:
infs=[gnb.getInformation(bn) for bn in bns]

In [None]:
gnb.sideBySide(*infs,
              captions=[*[f"with ratio {r}<br/> [datasize : {r*sizePrior+(1-r)*sizeData}]" for r in ratios]],
              valign="bottom")

It is obvious that these arcs represent weak and spurious correlations due to mixing probabilities (see Wellman et Peacock (99)) that become weaker when the weight of the prior increases.