In [1]:
import numpy as np
import pylab as pl
from scipy.stats import norm

In [2]:
dat = np.loadtxt('input_ex3.dat')

label=['ID','SS-IN','SED-IN','COND-IN','SS-OUT','SED-OUT','COND-OUT','STAT']
out=['OK','set','sol']
nout=3
nfeatures=len(label)-2 

In order to construct a proper Naives model, we need the prior of the target levels, i.e. $P\Big[t_j\Big]$ and the conditional probability of the features given the target levels $P\Big[f_i~|~t_j\Big]$. First, lets compte the priors:

In [3]:
priors = np.zeros(nout)

for i in range(len(priors)):
    priors[i] =  dat[:,-1].tolist().count(i+1)
    
priors /=np.sum(priors)
print(priors)

[ 0.30769231  0.38461538  0.30769231]


Now, let us compute the conditional pdf. Following the exercice, we will assume that all $P\Big[f_i~|~t_j\Big]$ are normally distributed. We thus have to determine $N_{features}\times N_{target}$ mean and standard deviations that characterized these pdf:

In [4]:
mean = np.zeros((nout,nfeatures))
std  = np.zeros((nout,nfeatures))

for j in range(nout):
    idx = np.where(dat[:,-1]==j+1)
    for i in range(nfeatures):
        mean[j,i] = np.mean(dat[idx,i+1])
        std[j,i]  = np.std(dat[idx,i+1],ddof=1)
        print('P('+label[i+1]+'|'+out[j]+'  = N(m='+str(mean[j,i])+',sig='+str(std[j,i])+')')
#print (mean)
#print(std)

P(SS-IN|OK  = N(m=189.0,sig=45.416590214)
P(SED-IN|OK  = N(m=3.125,sig=0.25)
P(COND-IN|OK  = N(m=1860.5,sig=371.402297606)
P(SS-OUT|OK  = N(m=18.0,sig=6.05530070819)
P(SED-OUT|OK  = N(m=0.054,sig=0.0974029431451)
P(COND-OUT|OK  = N(m=2036.0,sig=532.191068446)
P(SS-IN|set  = N(m=200.8,sig=55.1289397685)
P(SED-IN|set  = N(m=4.4,sig=1.78185296812)
P(COND-IN|set  = N(m=1251.2,sig=116.244139637)
P(SS-OUT|set  = N(m=98.0,sig=23.3773394551)
P(SED-OUT|set  = N(m=1.018,sig=1.52663682649)
P(COND-OUT|set  = N(m=1372.0,sig=142.578048801)
P(SS-IN|sol  = N(m=1301.0,sig=485.440006592)
P(SED-IN|sol  = N(m=32.5,sig=11.9582607431)
P(COND-IN|sol  = N(m=1621.0,sig=453.037893926)
P(SS-OUT|sol  = N(m=49.1,sig=37.7558825439)
P(SED-OUT|sol  = N(m=1293.0,sig=430.950886606)
P(COND-OUT|sol  = N(m=832.85,sig=958.312233391)


The model has been built (youpie) and we can now make predictions out of it. Our purpose is to determine what is the status of the water treatment given some features. For this, we use the Naives Bayes Model. We introduce the ``Model'' of a target given some features: $M\Big[t_j ~| ~ \mathbf f\Big]=\prod_{i=1}^{N_{features}}P\Big[f_i ~|~t_j\Big]P[t_j]$ (with $\mathbf f$ a set of features). This $M\Big[t_j ~| ~ \mathbf f\Big]\propto P\Big[t_j ~| ~ \mathbf f\Big]$ up to a constant that does not depend on the target levels (and under the assumption of conditional independence). The Naive Bayes model prediction is given by the target level which gives the maximal $M\Big[t_j ~| ~ \mathbf f\Big]$ (MAP estimator).

In conclusion, given a set of features $\mathbf f$, we need to compute $M\Big[t_j ~| ~ \mathbf f\Big]$ for the 3 target levels and choose the one which is maximal.
In the exercice 3, the set of features observed is given by
- SS-IN    = 222
- SED-IN   = 4.5
- COND-IN  = 1518
- SS-OUT   = 74
- SED-OUT  = 0.25
- COND-OUT = 1642

In [5]:
x = np.array([222,4.5,1518,74,0.25,1642])
M = priors.copy()

for j in range(nout):
    for i in range(nfeatures):
        M[j] *= norm.pdf(x[i],mean[j,i],std[j,i])

print(M)
print(M/np.sum(M))

[  3.41577454e-36   1.53837179e-13   1.00668352e-21]
[  2.22038297e-23   9.99999993e-01   6.54382460e-09]


Conclusion: there is a problem with the plant settler equipment !