# Exploration of Collider Bias with discrete Bayes Nets using `pomegranate`

## Collider bias
[Collider bias](https://en.wikipedia.org/wiki/Collider_(statistics) ) can be seen as a form of selection bias. In a causal network, if we have the structure `A -> Collider <- B`, then "conditioning on the collider opens the path between A and B [which can] introduce bias when estimating the causal association between A and B, potentially introducing associations where there are none."

## The `pomegranate` package
Here we use the Python package [pomegranate](https://pomegranate.readthedocs.io/en/latest/) (GitHub repo [here](https://github.com/jmschrei/pomegranate)). It allows us to specify a Bayesian Net structure, along with the conditional probability distributions. Amongst other things, we can  query the net, conditional upon certain observations.

In [1]:
import pomegranate as pg

# Example 2: Does collider bias make it look like nicotine is protective of COVID

![](img/bayes_net_1.png)

Here we have a prior probability of 25% of being a smoker. **Independently**, there is a 10% prior probability of having Covid-19. The vector `[0.9, 0.1, 0.9, 0.01]` specifies the probability of being in hospital as a function of smoking and covid status:

- smoking, covid. High probability of being in hospital.
- smoking, ¬covid. Moderate probability of being in hospital
- ¬smoking, covid. High probability of being in hospital
- ¬smoking, ¬covid. Low probability of being in hospital

One of the key points here is that of people who do not have covid, smokers are expected to have a higher probability of being in hospital than non-smokers.

In [2]:
smokeD = pg.DiscreteDistribution({'yes': 0.25, 'no': 0.75})
covidD = pg.DiscreteDistribution({'yes': 0.1, 'no': 0.9})
hospitalD = pg.ConditionalProbabilityTable(
    [['yes', 'yes', 'yes', 0.9], ['yes', 'yes', 'no', 0.1],
     ['yes', 'no', 'yes', 0.1], ['yes', 'no', 'no', 0.9],
     ['no', 'yes', 'yes', 0.9], ['no', 'yes', 'no', 0.1],
     ['no', 'no', 'yes', 0.01], ['no', 'no', 'no', 0.99]],
    [smokeD, covidD])

smoke = pg.Node(smokeD, name="smokeD")
covid = pg.Node(covidD, name="covidD")
hospital = pg.Node(hospitalD, name="hospitalD")

model = pg.BayesianNetwork("Covid Collider")
model.add_states(smoke, covid, hospital)
model.add_edge(smoke, hospital)
model.add_edge(covid, hospital)
model.bake()

### Sanity check
Probability you have covid depending on if you are a smoker or non-smoker. These should be equal as our Bayes Net specifies that smoking and covid are independent.

In [3]:
beliefs = model.predict_proba({'smokeD': 'yes'})

In [4]:
model.plot()

ValueError: must have pygraphviz installed for visualization

In [5]:
beliefs = map(str, beliefs)

print("\n".join("{}\t{}".format(state.name, belief) 
                for state, belief, in zip(model.states, beliefs) ))

smokeD	yes
covidD	{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "yes" :0.10000000000000035,
            "no" :0.8999999999999996
        }
    ],
    "frozen" :false
}
hospitalD	{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "no" :0.8199999999999997,
            "yes" :0.18000000000000022
        }
    ],
    "frozen" :false
}


### Calculate P(covid|smoking, hospital) and P(covid|¬smoking, hospital)

If you condition upon being in hospital (ie only observer people who are in hospital), then it might be that we see different probabilities of having covid based on whether you smoke or not.

In [6]:
model.predict_proba({'smokeD': 'yes', 'hospitalD': 'yes'})

array(['yes',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "yes" :0.5000000000000006,
            "no" :0.4999999999999994
        }
    ],
    "frozen" :false
},
       'yes'], dtype=object)

In [7]:
model.predict_proba({'smokeD': 'no', 'hospitalD': 'yes'})

array(['no',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "yes" :0.9090909090909077,
            "no" :0.09090909090909226
        }
    ],
    "frozen" :false
},
       'yes'], dtype=object)

Confirmed.

Boom! There we have it. There is a _higher_ chance that you will have Covid-19 if you are a _non-smoker_ in hospital, then if you are a smoker in hospital. And this result is produced by a Bayesian Net which explicitly states that having covid is statistically independent from being a smoker.

This is not necessarily proof that nicotine is useless against covid. But it does show that somewhat perplexing epidemiological results should not be taken as sufficient evidence that nicotine is protective against covid.