# Cycling example

Does cycling influence prostate cancer?

Install packages at command line with:

    pip install bnlearn
    conda install -c ankurankan pgmpy

In [1]:
import bnlearn
from pgmpy.factors.discrete import TabularCPD

In [2]:
bnlearn.__version__

'0.3.6'

In [3]:
# pgmpy.__version__

## Example where cyling _is protective_ against prostate cancer.

In [4]:
def model_cycling_protective():
    # Define the network structure
    edges = [('cycle', 'prostate'),
             ('cycle', 'in_survey'),
             ('prostate', 'in_survey')]

    # Make the actual Bayesian DAG
    DAG = bnlearn.make_DAG(edges)
    
    cpt_cycle = TabularCPD(variable='cycle', 
                       variable_card=2, 
                       values=[[0.1], [0.9]],
                       state_names={'cycle': ['yes', 'no']})

    cpt_prostate = TabularCPD(variable='prostate', variable_card=2,
                              values=[[0.05, 0.15],
                                      [0.95, 0.85]],
                              evidence=['cycle'], evidence_card=[2], 
                              state_names={'cycle': ['yes', 'no'], 
                                           'prostate': ['yes', 'no']})

    cpt_in_survey = TabularCPD(variable='in_survey', variable_card=2,
    #                            values=[[0.04, 0.02, 0.001, 0.001],
    #                                    [0.96, 0.98, 0.999, 0.999]],
                               values=[[0.005, 0.02, 0.001, 0.001],
                                       [0.995, 0.98, 0.999, 0.999]],
                               evidence=['cycle', 'prostate'],
                               evidence_card=[2, 2],
                               state_names={'cycle': ['yes', 'no'], 
                                            'prostate': ['yes', 'no'], 
                                            'in_survey': ['yes', 'no']})
    
    DAG = bnlearn.make_DAG(DAG, CPD=[cpt_cycle, cpt_prostate, cpt_in_survey])
    return DAG

In [5]:
def model_cycling_unrelated():
    # Define the network structure
    edges = [('cycle', 'in_survey'),
             ('prostate', 'in_survey')]

    # Make the actual Bayesian DAG
    DAG = bnlearn.make_DAG(edges)
    
    cpt_cycle = TabularCPD(variable='cycle', 
                       variable_card=2, 
                       values=[[0.1], [0.9]],
                       state_names={'cycle': ['yes', 'no']})

    cpt_prostate = TabularCPD(variable='prostate', 
                           variable_card=2, 
                           values=[[0.1], [0.9]],
                           state_names={'prostate': ['yes', 'no']})

    cpt_in_survey = TabularCPD(variable='in_survey', variable_card=2,
                               values=[[0.005, 0.02, 0.001, 0.001],
                                       [0.995, 0.98, 0.999, 0.999]],
                               evidence=['cycle', 'prostate'],
                               evidence_card=[2, 2],
                               state_names={'cycle': ['yes', 'no'], 
                                            'prostate': ['yes', 'no'], 
                                            'in_survey': ['yes', 'no']})
    
    DAG = bnlearn.make_DAG(DAG, CPD=[cpt_cycle, cpt_prostate, cpt_in_survey])
    return DAG

In [6]:
def model_cycling_damaging():
    # Define the network structure
    edges = [('cycle', 'prostate'),
             ('cycle', 'in_survey'),
             ('prostate', 'in_survey')]

    # Make the actual Bayesian DAG
    DAG = bnlearn.make_DAG(edges)
    
    cpt_cycle = TabularCPD(variable='cycle', 
                       variable_card=2, 
                       values=[[0.1], [0.9]],
                       state_names={'cycle': ['yes', 'no']})

    cpt_prostate = TabularCPD(variable='prostate', variable_card=2,
                              values=[[0.15, 0.10],
                                      [0.85, 0.90]],
                              evidence=['cycle'], evidence_card=[2], 
                              state_names={'cycle': ['yes', 'no'], 
                                           'prostate': ['yes', 'no']})

    cpt_in_survey = TabularCPD(variable='in_survey', variable_card=2,
                               values=[[0.005, 0.02, 0.001, 0.001],
                                       [0.995, 0.98, 0.999, 0.999]],
                               evidence=['cycle', 'prostate'],
                               evidence_card=[2, 2],
                               state_names={'cycle': ['yes', 'no'], 
                                            'prostate': ['yes', 'no'], 
                                            'in_survey': ['yes', 'no']})

    DAG = bnlearn.make_DAG(DAG, CPD=[cpt_cycle, cpt_prostate, cpt_in_survey])
    return DAG

Function to query the models

In [7]:
def make_queries(DAG):
    q1 = bnlearn.inference.fit(DAG, variables=['prostate'], evidence={})
    q2 = bnlearn.inference.fit(DAG, variables=['prostate'], evidence={'in_survey': 'yes'})
    q3 = bnlearn.inference.fit(DAG, variables=['prostate'], evidence={'in_survey': 'yes', 'cycle': 'yes'})
    
    print(f"P(prostate) = {q1.values[0]*100:.1f}%")
    print(f"P(prostate|in survey) = {q2.values[0]*100:.1f}%")
    print(f"P(prostate|in survey, cycle) = {q3.values[0]*100:.1f}%")

Build and check the models

In [8]:
DAG_proctective = model_cycling_protective()
bnlearn.print_CPD(DAG_proctective)

[BNLEARN] Bayesian DAG created.
[BNLEARN] No changes made to existing Bayesian DAG.
[BNLEARN] Add CPD: cycle
[BNLEARN] Add CPD: prostate
[BNLEARN] Add CPD: in_survey
[BNLEARN.print_CPD] Model correct: True
[BNLEARN.print_CPD] Independencies:

[BNLEARN.print_CPD] Nodes: ['cycle', 'prostate', 'in_survey']
[BNLEARN.print_CPD] Edges: [('cycle', 'prostate'), ('cycle', 'in_survey'), ('prostate', 'in_survey')]
CPD of cycle:
+------------+-----+
| cycle(yes) | 0.1 |
+------------+-----+
| cycle(no)  | 0.9 |
+------------+-----+
CPD of prostate:
+---------------+------------+-----------+
| cycle         | cycle(yes) | cycle(no) |
+---------------+------------+-----------+
| prostate(yes) | 0.05       | 0.15      |
+---------------+------------+-----------+
| prostate(no)  | 0.95       | 0.85      |
+---------------+------------+-----------+
CPD of in_survey:
+----------------+---------------+--------------+---------------+--------------+
| cycle          | cycle(yes)    | cycle(yes)   | cycle(n

In [9]:
DAG_unrelated = model_cycling_unrelated()
bnlearn.print_CPD(DAG_unrelated)

[BNLEARN] Bayesian DAG created.
[BNLEARN] No changes made to existing Bayesian DAG.
[BNLEARN] Add CPD: cycle
[BNLEARN] Add CPD: prostate
[BNLEARN] Add CPD: in_survey
[BNLEARN.print_CPD] Model correct: True
[BNLEARN.print_CPD] Independencies:
(cycle _|_ prostate)
(prostate _|_ cycle)
[BNLEARN.print_CPD] Nodes: ['cycle', 'in_survey', 'prostate']
[BNLEARN.print_CPD] Edges: [('cycle', 'in_survey'), ('prostate', 'in_survey')]
CPD of cycle:
+------------+-----+
| cycle(yes) | 0.1 |
+------------+-----+
| cycle(no)  | 0.9 |
+------------+-----+
CPD of prostate:
+---------------+-----+
| prostate(yes) | 0.1 |
+---------------+-----+
| prostate(no)  | 0.9 |
+---------------+-----+
CPD of in_survey:
+----------------+---------------+--------------+---------------+--------------+
| cycle          | cycle(yes)    | cycle(yes)   | cycle(no)     | cycle(no)    |
+----------------+---------------+--------------+---------------+--------------+
| prostate       | prostate(yes) | prostate(no) | prostate

In [10]:
DAG_damaging = model_cycling_damaging()
bnlearn.print_CPD(DAG_damaging)

[BNLEARN] Bayesian DAG created.
[BNLEARN] No changes made to existing Bayesian DAG.
[BNLEARN] Add CPD: cycle
[BNLEARN] Add CPD: prostate
[BNLEARN] Add CPD: in_survey
[BNLEARN.print_CPD] Model correct: True
[BNLEARN.print_CPD] Independencies:

[BNLEARN.print_CPD] Nodes: ['cycle', 'prostate', 'in_survey']
[BNLEARN.print_CPD] Edges: [('cycle', 'prostate'), ('cycle', 'in_survey'), ('prostate', 'in_survey')]
CPD of cycle:
+------------+-----+
| cycle(yes) | 0.1 |
+------------+-----+
| cycle(no)  | 0.9 |
+------------+-----+
CPD of prostate:
+---------------+------------+-----------+
| cycle         | cycle(yes) | cycle(no) |
+---------------+------------+-----------+
| prostate(yes) | 0.15       | 0.1       |
+---------------+------------+-----------+
| prostate(no)  | 0.85       | 0.9       |
+---------------+------------+-----------+
CPD of in_survey:
+----------------+---------------+--------------+---------------+--------------+
| cycle          | cycle(yes)    | cycle(yes)   | cycle(n

In [11]:
make_queries(DAG_proctective)

  mask |= (ar1 == a)
Finding Elimination Order: : 100%|██████████| 2/2 [00:00<00:00, 1129.32it/s]
Eliminating: cycle: 100%|██████████| 2/2 [00:00<00:00, 422.58it/s]
Finding Elimination Order: : 100%|██████████| 1/1 [00:00<00:00, 890.51it/s]
Eliminating: cycle: 100%|██████████| 1/1 [00:00<00:00, 500.22it/s]
Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.1400 |
+---------------+-----------------+
| prostate(no)  |          0.8600 |
+---------------+-----------------+
[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.0566 |
+---------------+-----------------+
| prostate(no)  |          0.9434 |
+---------------+-----------------+
[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.0130 |
+---------------+-----------------+
| prostate(no)  |          0.9870 |
+---------------+-----------------+
P(prostate) = 14.0%
P(prostate|in survey) = 5.7%
P(prostate|in survey, cycle) = 1.3%





This model assumes cycling _does_ decrease risk of prostate cancer in respondents. And you observe a lower incidence (than in general population) of prostance cancer in your respondents, and an even lower incidence in those who cycle. This all makes sense.

But the question is - Can we observe lower incidents of prostate cancer in respondents (than in the general population) if there cycling and prostate cancer are unrelated OR if cycling does increase the risk of prostate cancer?

In [12]:
make_queries(DAG_unrelated)

Finding Elimination Order: : 100%|██████████| 2/2 [00:00<00:00, 556.90it/s]
Eliminating: cycle: 100%|██████████| 2/2 [00:00<00:00, 333.73it/s]
Finding Elimination Order: : 100%|██████████| 1/1 [00:00<00:00, 911.41it/s]
Eliminating: cycle: 100%|██████████| 1/1 [00:00<00:00, 567.56it/s]
Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.1000 |
+---------------+-----------------+
| prostate(no)  |          0.9000 |
+---------------+-----------------+
[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.0509 |
+---------------+-----------------+
| prostate(no)  |          0.9491 |
+---------------+-----------------+
[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.0270 |
+---------------+-----------------+
| prostate(no)  |          0.9730 |
+---------------+-----------------+
P(prostate) = 10.0%
P(prostate|in survey) = 5.1%
P(prostate|in survey, cycle) = 2.7%





In this model, cycling does not influence the risk of prostate cancer, but those with prostate cancer are less likely to complete the survey than those without.

We see the same pattern of results: Lower incidence of prostate cancer in the survey overall, and an even lower incidence of cyclists in the survey.

This is concerning! But can we still observe lower incidents of prostate cancer in respondents even if cycling does increase the risk of prostate cancer?

In [13]:
make_queries(DAG_damaging)

Finding Elimination Order: : 100%|██████████| 2/2 [00:00<00:00, 1194.79it/s]
Eliminating: cycle: 100%|██████████| 2/2 [00:00<00:00, 443.75it/s]
Finding Elimination Order: : 100%|██████████| 1/1 [00:00<00:00, 1007.04it/s]
Eliminating: cycle: 100%|██████████| 1/1 [00:00<00:00, 90.99it/s]
Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.1050 |
+---------------+-----------------+
| prostate(no)  |          0.8950 |
+---------------+-----------------+
[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.0617 |
+---------------+-----------------+
| prostate(no)  |          0.9383 |
+---------------+-----------------+
[BNLEARN][inference] Variable Elimination..
+---------------+-----------------+
| prostate      |   phi(prostate) |
| prostate(yes) |          0.0423 |
+---------------+-----------------+
| prostate(no)  |          0.9577 |
+---------------+-----------------+
P(prostate) = 10.5%
P(prostate|in survey) = 6.2%
P(prostate|in survey, cycle) = 4.2%





In this world, cycling _does_ increase the risk of prostate cancer. Nevertheless, _if_ people with prostate cancer are less likely to complete the survey than those without, then you can still observe this apparent protective effect.

This is concerning.

It means that observing lower incidence of prostate cancer in cyclists than in the general population does _not_ tell you anything about the causal relation between cycling and prostate cancer _unless_ you know