Problem (Re)Statement:

* Shortness of breath (dyspnea) may be due to tuberculosis, lung cancer or bronchitis, or none of them, or more than one of them. 
* A recent visit to Asia increases the chances of tuberculosis.
* Smoking is known to be a risk factor for both lung cancer and bronchitis. 
* A positive chest X-ray suggests either lung cancer or tuberculosis, but cannot distinguish between them

Here is a data set to pull your model parameters from

In [2]:
from pandas import *
df = pandas.read_csv("asia.csv")

In [3]:
df.head()

Unnamed: 0,Smoker,LungCancer,VisitToAsia,Tuberculosis,TuberculosisOrCancer,XRay,Bronchitis,Dyspnea
0,1,1,0,0,1,1,1,1
1,0,0,0,0,0,1,1,1
2,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,1,1
4,1,0,0,0,0,0,1,0


<image src="asia.gif" size=200/>

<image src="asia.gif"/>


Begin by writing out your model.  For example here are names of some nodes, and the arcs that connect them.
<pre>
Asia                 -> Tuberculosis

Smoker               -> LungCancer

Smoker               -> Bronchitis

Tuberculosis, Cancer -> TuberculosisOrCancer

TuberculosisOrCancer -> XRay

TuberculosisOrCancer, Bronchitis  -> Dyspnea
</pre>

-- Informally Write Your Model In This Cell -- 
It will determine the parameters you will need to get from the data set

Now define your distribtions

In [4]:
from pomegranate import *
# helper function to calculate the conditional probability 
def conditionalProb(df, arg1, a1, conditionArg1, c1, conditionArg2=None, c2=None, conditionArg3=None, c3=None):
    if (conditionArg2 is None) and (conditionArg3 is None): 
        # P(A|B)=P(A,B)/P(B) 
        return df[(conditionArg1 == c1) & (arg1 == a1)].shape[0]/df[(conditionArg1 == c1)].shape[0]    
    if conditionArg3 is None:
        # P(A|B,C)=P(A,B,C)/P(B^C)
        return df[(conditionArg1 == c1) & (conditionArg2 == c2) & (arg1 == a1)].shape[0]/ \
                              df[(conditionArg1 == c1) & (conditionArg2 == c2)].shape[0]
    else:
        # P(A|B,C,D)=P(A,B,C,D)/P(B^C^D)
        return df[(conditionArg1 == c1) & (conditionArg2 == c2) & (conditionArg3 == c3) & (arg1 == a1)].shape[0]/ \
                              df[(conditionArg1 == c1) & (conditionArg2 == c2) & (conditionArg3 == c3)].shape[0]

# Asia
asiadist =  DiscreteDistribution({"NotVisitAsia": (1-.0094), "VisitAsia": .0094})

# Smoker
pSmoker = df.Smoker.value_counts()[1]/df.shape[0]
smokerdist =  DiscreteDistribution({"NotSmoker": (1-pSmoker), "Smoker": pSmoker})

# Tuberculosis|Asia
pTuberculosisGivenAsia    = conditionalProb(df, df.Tuberculosis, 1, df.VisitToAsia, 1)
pTuberculosisGivenNotAsia = conditionalProb(df, df.Tuberculosis, 1, df.VisitToAsia, 0)
tuberculosisdist = ConditionalProbabilityTable([["VisitAsia","Tuberculosis", pTuberculosisGivenAsia],
                                                ["VisitAsia","NotTuberculosis", 1-pTuberculosisGivenAsia],
                                                ["NotVisitAsia","Tuberculosis", pTuberculosisGivenNotAsia], 
                                                ["NotVisitAsia","NotTuberculosis", 1-pTuberculosisGivenNotAsia]],
                                                [asiadist])


# LungCancer|Smoker
pCancerGivenSmoker    = conditionalProb(df, df.LungCancer, 1, df.Smoker, 1)
pCancerGivenNotSmoker = conditionalProb(df, df.LungCancer, 1, df.Smoker, 0)
cancerdist = ConditionalProbabilityTable([["Smoker","Cancer", pCancerGivenSmoker],    
                                          ["Smoker", "NotCancer", 1-pCancerGivenSmoker],
                                          ["NotSmoker","Cancer", pCancerGivenNotSmoker], 
                                          ["NotSmoker","NotCancer", 1-pCancerGivenNotSmoker]],
                                          [smokerdist])

# Bronchitis|Smoker
pBronGivenSmoker    = conditionalProb(df, df.Bronchitis, 1, df.Smoker, 1)
pBronGivenNotSmoker = conditionalProb(df, df.Bronchitis, 1, df.Smoker, 0)
bronchitisdist = ConditionalProbabilityTable([["Smoker","Bronchitis", pBronGivenSmoker],    
                                              ["Smoker","NotBronchitis", 1-pBronGivenSmoker],
                                              ["NotSmoker","Bronchitis", pBronGivenNotSmoker], 
                                              ["NotSmoker","NotBronchitis", 1-pBronGivenNotSmoker]],
                                              [smokerdist])

# TuberculosisOrCancer|Tuberculosis, Cancer
pTOrCGivenTC    = conditionalProb(df, df.TuberculosisOrCancer, 1, df.Tuberculosis, 1, df.LungCancer, 1)
pTOrCGivenTNotC = conditionalProb(df, df.TuberculosisOrCancer, 1, df.Tuberculosis, 1, df.LungCancer, 0)
pTOrCGivenCNotT = conditionalProb(df, df.TuberculosisOrCancer, 1, df.Tuberculosis, 0, df.LungCancer, 1)
pTOrCGivenNotTC = conditionalProb(df, df.TuberculosisOrCancer, 1, df.Tuberculosis, 0, df.LungCancer, 0)
tOrCdist = ConditionalProbabilityTable([["Tuberculosis","Cancer","TuberculosisOrCancer", pTOrCGivenTC],      
                                        ["Tuberculosis","Cancer","NotTuberculosisOrCancer", 1-pTOrCGivenTC],
                                        ["Tuberculosis","NotCancer","TuberculosisOrCancer", pTOrCGivenTNotC],   
                                        ["Tuberculosis","NotCancer","NotTuberculosisOrCancer", 1-pTOrCGivenTNotC],
                                        ["NotTuberculosis","Cancer","TuberculosisOrCancer", pTOrCGivenCNotT],   
                                        ["NotTuberculosis","Cancer","NotTuberculosisOrCancer", 1-pTOrCGivenCNotT],
                                        ["NotTuberculosis","NotCancer","TuberculosisOrCancer", pTOrCGivenNotTC],
                                        ["NotTuberculosis","NotCancer","NotTuberculosisOrCancer", 1-pTOrCGivenNotTC]],
                                        [tuberculosisdist, cancerdist])

# XRay|TuberculosisOrCancer
pXRayGivenTC    = conditionalProb(df, df.XRay, 1, df.TuberculosisOrCancer, 1)
pXRayGivenNotTC = conditionalProb(df, df.XRay, 1, df.TuberculosisOrCancer, 0)
xraydist = ConditionalProbabilityTable([["TuberculosisOrCancer","XRay", pXRayGivenTC],      
                                        ["TuberculosisOrCancer","NotXRay", 1-pXRayGivenTC],
                                        ["NotTuberculosisOrCancer","XRay", pXRayGivenNotTC],   
                                        ["NotTuberculosisOrCancer","NotXRay", 1-pXRayGivenNotTC]],
                                        [tOrCdist])


# Dyspnea|TuberculosisOrCancer,Bronchitis
pDyspneaGivenTCB    = conditionalProb(df, df.Dyspnea, 1, df.TuberculosisOrCancer, 1, df.Bronchitis, 1)
pDyspneaGivenTCNotB = conditionalProb(df, df.Dyspnea, 1, df.TuberculosisOrCancer, 1, df.Bronchitis, 0)
pDyspneaGivenBNotTC = conditionalProb(df, df.Dyspnea, 1, df.TuberculosisOrCancer, 0, df.Bronchitis, 1)
pDyspneaGivenNotTCB = conditionalProb(df, df.Dyspnea, 1, df.TuberculosisOrCancer, 0, df.Bronchitis, 0)
dyspneadist = ConditionalProbabilityTable([["TuberculosisOrCancer","Bronchitis","Dyspnea", pDyspneaGivenTCB],      
                                        ["TuberculosisOrCancer","Bronchitis","NotDyspnea", 1-pDyspneaGivenTCB],
                                        ["TuberculosisOrCancer","NotBronchitis","Dyspnea", pDyspneaGivenTCNotB],   
                                        ["TuberculosisOrCancer","NotBronchitis","NotDyspnea", 1-pDyspneaGivenTCNotB],
                                        ["NotTuberculosisOrCancer","Bronchitis","Dyspnea", pDyspneaGivenBNotTC],   
                                        ["NotTuberculosisOrCancer","Bronchitis","NotDyspnea", 1-pDyspneaGivenBNotTC],
                                        ["NotTuberculosisOrCancer","NotBronchitis","Dyspnea", pDyspneaGivenNotTCB],
                                        ["NotTuberculosisOrCancer","NotBronchitis","NotDyspnea", 1-pDyspneaGivenNotTCB]],
                                        [tOrCdist, bronchitisdist])

Next define the nodes in your network

In [5]:
asia                 = Node(asiadist,         name="asia")
smoker               = Node(smokerdist,       name="smoker")
tuberculosis         = Node(tuberculosisdist, name="tuberculosis")
lungCancer           = Node(cancerdist,       name="lungCancer")
bronchitis           = Node(bronchitisdist,   name="bronchitis")
xray                 = Node(xraydist,         name="xray")
dyspnea              = Node(dyspneadist,      name="dyspnea")
tuberculosisOrCancer = Node(tOrCdist,         name="tuberculosisOrCancer")

Define your model, adding states and edges

In [6]:
model = BayesianNetwork("Breathing Diagnosis")
model.add_states(asia, smoker, lungCancer, tuberculosis, bronchitis, xray, dyspnea, tuberculosisOrCancer)
model.add_edge(asia, tuberculosis)
model.add_edge(smoker, lungCancer)
model.add_edge(smoker, bronchitis)
model.add_edge(tuberculosis, tuberculosisOrCancer)
model.add_edge(lungCancer, tuberculosisOrCancer)
model.add_edge(tuberculosisOrCancer, xray)
model.add_edge(tuberculosisOrCancer, dyspnea)
model.add_edge(bronchitis, dyspnea)
model.bake()

------------------------------------------------

#### Questions

1.  Before checking, write down your guess for the probability that an individual in the population at large would have a positive X-Ray (i.e. a result that suggests either lung cancer or tuberculosis)

I think that the probability of having a postive X-Ray should be less than 1%. There shouldn't be many people who have lung cancer or tuberculosis.

2.  Now read that probability from the model you built.   

In [7]:
model.predict_proba({})[5].parameters[0].get('XRay')

0.11053500066501697

3.  Is your empirical result significantly different from your guess?  If so, explain why.  If not, explain how you came to your original guess.

Yes. These two values are very different. According to this data, there are around 11% of population that has a positive X-Ray. So, I think the reason could be this data is not representative for the current population. This data is most likely biased. 

4.  How much does a trip to Asia affect the likelihood of an individual having Dyspnea?

In [8]:
model.predict_proba({"asia":"VisitAsia"})[6].parameters[0].get('Dyspnea')

0.4538813486201607

The likelihood of an individual having Dyspnea is 0.4538813486201607 in a given situation (a trip to Asia)

5.  Suppose you are a nonsmoker individual presenting with Dyspnea and you have never been to Asia.  In your panic you have a chest XRay done, which comes out negative.   What do you now know about the state of your health?

In [9]:
model.predict_proba({"smoker":"NotSmoker","dyspnea":"Dyspnea","asia":"NotVisitAsia", "xray":"NotXRay"})

array(['NotVisitAsia', 'NotSmoker',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "NotCancer" :0.9993319686248006,
            "Cancer" :0.0006680313751993797
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Tuberculosis" :0.0006638004335899343,
            "NotTuberculosis" :0.9993361995664101
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "NotBronchitis" :0.21997111134443062,
            "Bronchitis" :0.7800288886555693
        }
    ],
    "frozen" :false
},
       'NotXRay', 'Dyspnea',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "TuberculosisOrCancer"

This individual has very low probabilties of having Cancer(0.0006680313751993797), Tuberculosis(0.0006638004335899343) but has a high probability of having Bronchitis(0.7800288886555693)   