* Causal strength
* Confounding
* Interventions
* Counter factuals


https://ei.is.tuebingen.mpg.de/research_projects/causal-inference

First let consider a simple example

**Good/bad/ugly**

We have 3 events, a child being good (P(g=1)=0.7), a child being bad (P(b=1)=0.5), and a child being given a lolly. A child being given a lolly is caused by the child being good (P(l=1|g=1) = 1)

In [6]:
import numpy as np
n = 1000
#Generate some 'observed' data
good = np.array([1 if x<0.9 else 0 for x in np.random.random(n)]) 
bad = np.array([1 if x<0.3 else 0 for x in np.random.random(n)])  #
attention = np.array([1 if good[i] else 0 for i in range(n)]) 

* Marginal probabilities are calculated as the sum of events occured divided by the potential amount of times they could have occured.
* Joint probabilities can be calculated by simply multiplying the marginal probabilities. P(A,B) = P(A) * P(B). However, this makes the assumption that A and B are independent (this will lead to problems)
* Conditional probabilities can be calculated by taking the joint probability of the two events and dividing it by the marginal probability of whichever event is given. P(A|B) = P(A,B)/P(B)  

In [7]:
import tabulate as tab
#Calculating priors from observed data
#Marginal probabilities
tags = ['g=1','b=1','l=1','g=0','b=0','l=0']
mP = np.array([np.sum(good)/len(good),np.sum(bad)/len(bad),np.sum(attention)/len(attention)])
mP = np.concatenate((mP,1-mP),axis = 0)
MP = list(zip(tags,list(mP)))
print('Marginal probabilities - (priors)')
print(tab.tabulate(MP))

#Joint probabilities
jP = []
n = len(MP)
for i in range(n):
    for j in range(n):
        if i<j:
            jP.append([ ( 'P(' + MP[i][0] + ',' + MP[j][0]) + ')'
                       , MP[i][1]*MP[j][1]]) 
for i in range(0,2):
    for j in range(2,4):
        for k in range(4,6):
            jP.append([  'P(' + MP[i][0] + ',' + MP[j][0] + ',' + MP[k][0]+ ')'
           , MP[i][1]*MP[j][1]*MP[k][1] ])
print('Joint probabilities')
print(tab.tabulate(jP))

Marginal probabilities - (priors)
---  -----
g=1  0.892
b=1  0.317
l=1  0.892
g=0  0.108
b=0  0.683
l=0  0.108
---  -----
Joint probabilities
--------------  ----------
P(g=1,b=1)      0.282764
P(g=1,l=1)      0.795664
P(g=1,g=0)      0.096336
P(g=1,b=0)      0.609236
P(g=1,l=0)      0.096336
P(b=1,l=1)      0.282764
P(b=1,g=0)      0.034236
P(b=1,b=0)      0.216511
P(b=1,l=0)      0.034236
P(l=1,g=0)      0.096336
P(l=1,b=0)      0.609236
P(l=1,l=0)      0.096336
P(g=0,b=0)      0.073764
P(g=0,l=0)      0.011664
P(b=0,l=0)      0.073764
P(g=1,l=1,b=0)  0.543439
P(g=1,l=1,l=0)  0.0859317
P(g=1,g=0,b=0)  0.0657975
P(g=1,g=0,l=0)  0.0104043
P(b=1,l=1,b=0)  0.193128
P(b=1,l=1,l=0)  0.0305385
P(b=1,g=0,b=0)  0.0233832
P(b=1,g=0,l=0)  0.00369749
--------------  ----------


Calculating the conditional probabilities is pointless using the equation P(A|B) = P(A,B)/P(B) as it will simply return P(A). 

* Why do these probabilties make little sense? For example we know that P(l=1,g=1) = P(g=1) x P(l=1|g=1) = 1x0.7.
* And why can we not figure out the conditional probabilities. 

The problem is that our initial system doesnt use any information about the causal structure of these variables. So, how can we figure out that attention depends on good/bad?
  
Firstly, we have ignored information about how events are correlated. We can calculate the conditional probabilities from correlations. Which we can use to claculate the joint probabilities.


So how can we infer causality from these joint and conditional prababilities? What patterns are there? 

In [8]:
#Recalculating using correlations
def corr(a,b,c=[]): #conditional probability of a given b
    return np.sum(a*b)/(np.sum(b) + np.sum(c))

data = [good,bad,attention]
cP = []
jP = []
for i in range(3):
    for j in range(3):
        if i != j:
            #Conditional probabilities
            cP.append([ 'P(' + tags[i] +'|' + tags[j] + ')',
                    corr(data[i],data[j])])
            cP.append([ 'P(' + tags[i+3] +'|' + tags[j+3] + ')',
                    1-cP[-1][1] ])
            
            
            #Joint probabilities
            jP.append([ 'P(' + tags[i] +',' + tags[j] + ')',
                    cP[i][1]*MP[j][1]   ])
            jP.append([ 'P(' + tags[i+3] +',' + tags[j+3] + ')',
                    1-jP[i][1]  ])
"""         
            for k in range(3):
                if i != k:
                    cP.append([  'P(' + tags[i] +'|' + tags[j] + ',' + tags[k] ')',
                            ]) 

                    cP.append([  'P(' + tags[i] + ','  + tags[j] +'|'+ tags[k] ')',
                            ]) 
"""
print('Conditional probabilities calculated with correlations')       
print(tab.tabulate(cP))
print('Joint probabilities calculated from conditionals')
print(tab.tabulate(jP))

Conditional probabilities calculated with correlations
----------  ---------
P(g=1|b=1)  0.902208
P(g=0|b=0)  0.0977918
P(g=1|l=1)  1
P(g=0|l=0)  0
P(b=1|g=1)  0.320628
P(b=0|g=0)  0.679372
P(b=1|l=1)  0.320628
P(b=0|l=0)  0.679372
P(l=1|g=1)  1
P(l=0|g=0)  0
P(l=1|b=1)  0.902208
P(l=0|b=0)  0.0977918
----------  ---------
Joint probabilities calculated from conditionals
----------  ---------
P(g=1,b=1)  0.286
P(g=0,b=0)  0.714
P(g=1,l=1)  0.80477
P(g=0,l=0)  0.714
P(b=1,g=1)  0.0872303
P(b=0,g=0)  0.286
P(b=1,l=1)  0.0872303
P(b=0,l=0)  0.286
P(l=1,g=1)  0.892
P(l=0,g=0)  0.19523
P(l=1,b=1)  0.317
P(l=0,b=0)  0.19523
----------  ---------


Now we see that P(a=1|g=1) = 1, which tells us that these two variables are perfectly correlated. How do we infer causality?

So if some event, which we will call A, happens a high amount of the time (>70%) then A will be highly correlated with other events. So, how do we distinguish whether A is causing other events or just an independent variable that always seems to be present?

Well, if other events occur just once without A then A cannot be the cause. But then you need to consider noise and the chance that you are not observing the true state of the events... So that doesnt really help us, unless we can integrate a probabalistic model of the expected amount of noise. 

Now consider this for non binary variables. This will require distrubutions of probability to 

In [None]:
good = np.random.randn(n)
bad = np.random.randn(n)
attention = np.greater(0.5,good+bad) #should this be binary? 

Questions
* How should these probabilities be represented in a network? How can a neural network capture explaining away?
* How should a neural network represent marginal, joint and conditional probabilities? Or should it calculate them?
* 

Generating a model of priors from observed data with a neural net. How? Local vs distributed represetation? Learning... Embedding high dim vectors???





Questions and notes
* So in calculating the marginal porbabilities we have assumed that variables cannot be in a superposition of states.
* What if we dont have the ability to recognise, for certain, when and which events are occuring.
* How should a new piece of information effect the system? 