# Bayesian Network, also known as a Bayes Network, Belief Network, or Decision Network

A Bayesian Network, also known as a Bayes Network, Belief Network, or Decision Network, is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). It's a powerful tool for dealing with uncertainty in complex systems, making it useful in various fields such as machine learning, artificial intelligence, bioinformatics, and epidemiology.

Here’s a breakdown of the key components and concepts in a Bayesian Network:

##  1. Nodes 
Each node in the network represents a variable, which can be either observable quantities, latent variables, or unknown parameters. These variables can be discrete or continuous.

## 2. Edges
The directed edges (arrows) between nodes represent conditional dependencies. An edge from node A to node B means that B is directly dependent on A; A influences or has a causal effect on B.

## 3. Conditional Probability Tables (CPTs)
For each node with incoming edges, a conditional probability table is associated that quantifies the effect of the parent nodes on the node, serving as the mechanism to specify the relationships between variables. A CPT for a node defines the probability of each possible value of the node given every possible combination of values of its parent nodes. Essentially, it quantifies how the presence or state of parent nodes influences the probability distribution of the child node.
For nodes without parents (root nodes), the CPT reduces to the prior probability of the node.

### 3.1. Equation for Conditional Probability

The general form of a conditional probability is given by:

$ P(X|Y) = \frac{P(X \cap Y)}{P(Y)} $

where:
- $P(X|Y)$ is the probability of $X$ given $Y$,
- $P(X \cap Y)$ is the joint probability of $X$ and $Y$ occurring together,
- $P(Y)$ is the probability of $Y$.

In the context of Bayesian Networks and CPTs, we often deal with more specific forms, especially when we have multiple parent nodes. For a node $X$ with parents $Y$ and $Z$, the conditional probability can be specified as:

$ P(X|Y, Z) $

This form doesn't simplify the concept of conditional probability but rather specifies it for situations where $X$ depends on multiple variables.

### 3.2. Example of a CPT

Let's consider a simple example with two parent nodes $A$ and $B$, and a child node $C$. Suppose each node is binary (can either be true or false). The CPT for $C$ would specify the probability of $C$ being true or false given every combination of $A$ and $B$.

Let's say we have the following probabilities:

- $P(C = true | A = true, B = true) = 0.9$
- $P(C = true | A = true, B = false) = 0.5$
- $P(C = true | A = false, B = true) = 0.4$
- $P(C = true | A = false, B = false) = 0.1$

This CPT quantifies how the states of $A$ and $B$ influence $C$. For example, $C$ is most likely to be true when both $A$ and $B$ are true, and least likely when both are false.

For completeness, the probabilities of $C = false$ given each combination of $A$ and $B$ would simply be $1$ minus the probabilities of $C = true$, because $C$ must either be true or false.

### 3.3. How to Use a CPT

CPTs are used in Bayesian Network inference to compute the probabilities of certain events given evidence. For example, if you know that $A = true$ and $B = false$, you can look up in the CPT to find that $P(C = true | A = true, B = false) = 0.5$. This information can be used directly in inference calculations or to update beliefs in light of new evidence.

CPTs represent a structured way to encode and utilize conditional probabilities in Bayesian Networks, enabling complex reasoning about interdependencies between variables.


## 4. Joint Probability Distribution
A Bayesian Network represents the joint probability distribution of all the variables in the system. Using the chain rule for Bayesian probability, this joint distribution can be decomposed into a product of conditional distributions as specified by the network structure. The Joint Probability Distribution of a set of variables is a probability distribution that specifies the probability of every possible combination of those variables. In Bayesian Networks, the joint probability distribution over all variables in the network can be expressed using the chain rule for probabilities, which allows for the decomposition of the joint distribution into a product of conditional distributions according to the network's structure.

### 4.1. Equation for the Joint Probability Distribution in Bayesian Networks

Given a set of variables $X_1, X_2, ..., X_n$, the joint probability distribution can be decomposed as follows:

$ P(X_1, X_2, ..., X_n) = P(X_1) \cdot P(X_2|X_1) \cdot P(X_3|X_1, X_2) \cdot ... \cdot P(X_n|X_1, X_2, ..., X_{n-1}) $

In a Bayesian Network, this decomposition aligns with the network's structure, simplifying to:

$ P(X_1, X_2, ..., X_n) = \prod_{i=1}^{n} P(X_i | Parents(X_i)) $

where $Parents(X_i)$ are the parent nodes of $X_i$ in the network. If $X_i$ has no parents, then $P(X_i | Parents(X_i))$ simply becomes $P(X_i)$, the prior probability of $X_i$.

### 4.2. Example

Consider a simple Bayesian Network with three variables: Rain ($R$), Sprinkler ($S$), and Wet Grass ($W$). Suppose $R$ affects both $S$ and $W$, and $S$ also affects $W$. The structure implies the following relationships:

- $R$ has no parents.
- $S$ depends on $R$.
- $W$ depends on both $R$ and $S$.

The joint probability distribution for this network can be expressed as:

$ P(R, S, W) = P(R) \cdot P(S|R) \cdot P(W|R, S) $

Let's fill in some details:

- $P(R=true) = 0.3$
- $P(S=true|R=true) = 0.5$, $P(S=true|R=false) = 0.1$
- $P(W=true|R=true, S=true) = 0.9$, $P(W=true|R=true, S=false) = 0.8$, $P(W=true|R=false, S=true) = 0.9$, $P(W=true|R=false, S=false) = 0$

To calculate a specific joint probability, for example, the probability that it rained, the sprinkler was on, and the grass is wet, we would use:

$ P(R=true, S=true, W=true) = P(R=true) \cdot P(S=true|R=true) \cdot P(W=true|R=true, S=true) $

Let's calculate this probability.

The probability that it rained ($R=true$), the sprinkler was on ($S=true$), and the grass is wet ($W=true$) is $0.135$ or $13.5\%$.

This example demonstrates how the joint probability distribution of a Bayesian Network can be decomposed into a product of conditional distributions according to the network's structure, allowing for the calculation of the probability of specific combinations of variable states.

In [2]:
# Given probabilities
P_R_true = 0.3
P_S_true_given_R_true = 0.5
P_W_true_given_R_true_S_true = 0.9

# Calculating specific joint probability
P_joint = P_R_true * P_S_true_given_R_true * P_W_true_given_R_true_S_true

P_joint


0.135

The essence of a Bayesian Network lies in its ability to model the joint probability distribution over a set of variables in a compact and efficient manner by exploiting conditional independencies among variables. This allows for:

1. **Inference**: Given some observed variables, we can compute the posterior distribution of other variables. For example, in a medical diagnosis network, observing symptoms can help compute the probabilities of various diseases.

2. **Learning**: If the structure of the network is known but the parameters (probabilities in CPTs) are unknown, they can be learned from data. Alternatively, learning can also involve discovering the structure of the network from data.

3. **Reasoning under uncertainty**: By incorporating evidence and updating beliefs, Bayesian Networks enable reasoning in situations where information is incomplete or uncertain.

4. **Decision making and prediction**: Bayesian Networks can be used to make predictions about future events or to make decisions that maximize expected utility based on the probabilistic dependencies between variables.



## 1. Inference in Bayesian Networks
Inference in Bayesian Networks involves computing the posterior distribution of a set of target variables given evidence. One of the most fundamental equations for Bayesian inference is derived from Bayes' theorem. For a simple case involving a target variable $X$ and evidence $E$, Bayes' theorem can be stated as:

$ P(X|E) = \frac{P(E|X)P(X)}{P(E)} $

where:
- $P(X|E)$ is the posterior probability of $X$ given evidence $E$,
- $P(E|X)$ is the likelihood of observing $E$ given $X$,
- $P(X)$ is the prior probability of $X$, and
- $P(E)$ is the probability of observing the evidence $E$, which acts as a normalization constant ensuring that the posterior probabilities sum to 1.

In the context of Bayesian Networks, inference often involves more complex scenarios with multiple variables. A common task is to compute the posterior distribution of some variables given observations for others. This can involve summing out the unobserved variables from the joint probability distribution.

#### Numerical Example

Let's consider a simple example with two binary variables: Rain ($R$) and Wet Grass ($W$). Suppose we know the following probabilities:

- $P(R = true) = 0.2$ (prior probability of it raining)
- $P(W = true | R = true) = 0.9$ (probability that the grass is wet given that it rained)
- $P(W = true | R = false) = 0.1$ (probability that the grass is wet without rain)

Given that the grass is wet ($W = true$), we want to find the probability that it has rained ($P(R = true | W = true)$).

Using Bayes' theorem:

$ P(R = true | W = true) = \frac{P(W = true | R = true)P(R = true)}{P(W = true)} $

To find $P(W = true)$, we consider all ways $W$ can be true:

$ P(W = true) = P(W = true | R = true)P(R = true) + P(W = true | R = false)P(R = false) $

Let's calculate these values.

$ P(W = true) = 0.9 \times 0.2 + 0.1 \times 0.8 = 0.18 + 0.08 = 0.26 $

Thus,

$ P(R = true | W = true) = \frac{0.9 \times 0.2}{0.26} $

Let's calculate the final result.

The probability that it has rained, given that the grass is wet, is approximately $0.692$ or $69.2\%$.



Improving and expanding on the provided list of approaches to Bayesian inference, we can categorize and detail them more clearly, integrating both approximate and exact methods under their respective umbrella terms and sub-categories, as follows:

### 1.1. Approximate Bayesian Inference

Approximate inference methods are used when exact computation of the posterior distribution is intractable due to the complexity of the model or the size of the data.

#### 1.1-1) Randomized Approaches
Methods that involve randomness or sampling to approximate the posterior distribution.

- **1-1-1) Importance Sampling:** A technique where samples are drawn from an easy-to-sample distribution and then weighted to approximate the posterior distribution.

- **1-1-2) Markov Chain Monte Carlo (MCMC):** Uses Markov chains to generate samples from the posterior distribution.
  - **1-1-2-1) Gibbs Sampling:** A special case of MCMC that cycles through each variable to sample from its conditional distribution given all other variables.

#### 1.1-2) Deterministic Approaches
These methods deterministically approximate the posterior without relying on stochastic sampling.

- **1.1-2-1) Variational Inference:** Approximates the posterior by finding a simpler distribution that minimizes the KL divergence to the true posterior.
  - **1.1-2-1-1) Mean Field Variational Inference:** Assumes that the model's latent variables are independent and optimizes the parameters of the approximating distribution accordingly.

- **1.1-2-2) Loopy Belief Propagation:** An iterative algorithm for approximating marginal distributions in graphical models, particularly those with loops.

- **1.1-2-3) LP Relaxations:** Linear programming relaxations of discrete or combinatorial optimization problems.
  - **1.1-2-3-1) Dual Decomposition:** A method that decomposes an optimization problem into smaller subproblems that can be solved independently and efficiently.

- **1.1-2-4) Beam Search:** A heuristic search algorithm that explores a graph by expanding the most promising nodes.
  - **1.1-2-4-1) Local Search:** An optimization technique that starts with an initial solution and iteratively moves to a neighboring solution with a higher likelihood.

### 1.2. Exact Bayesian Inference

Exact inference methods compute the posterior distribution without approximations but are often limited to simpler or more structured models due to computational constraints.

- **1.2-1) Integer Linear Programming (ILP):** A mathematical approach to solve optimization problems where the objective function and constraints are linear, and decision variables are integers.

- **1.2-2) Variable Elimination:** A systematic method for computing marginals in graphical models by sequentially summing out variables.
  - **1.2-2-1) Dynamic Programming:** A method for solving complex problems by breaking them down into simpler subproblems, applicable in Bayesian networks for efficiently computing marginals.

These categories and subcategories provide a structured overview of the diverse methods available for Bayesian inference, highlighting their specific applications and the trade-offs between approximation and computational efficiency.


### 1.3 Problem Scenario: Estimating the Bias of a Coin

Suppose we have a coin that we suspect is not fair, and we want to estimate the bias of the coin, specifically, the probability $p$ of landing heads (H) when flipped. We perform an experiment where we flip the coin $N$ times and observe the number of heads $k$.

We wish to infer the probability $p$ based on our observations, using a Bayesian approach. We can model our prior belief about $p$ using a Beta distribution, a common choice for probabilities, with parameters $\alpha$ and $\beta$. This gives us a flexible way to encode our prior knowledge or lack thereof about the coin's fairness.

For this example, let's say we have no strong prior belief about the coin's bias, so we choose $\alpha = 1$ and $\beta = 1$, which corresponds to a uniform distribution over $[0, 1]$. Then, we flip the coin $N = 10$ times, observing $k = 7$ heads.

### 1.4 Bayesian Model

- **Likelihood**: The likelihood of observing $k$ heads in $N$ flips given $p$ is modeled by the Binomial distribution: $P(k | p, N) = \binom{N}{k} p^k (1-p)^{N-k}$.
- **Prior**: The prior distribution of $p$ is $Beta(\alpha, \beta)$. Given our choice, it's $Beta(1, 1)$, a uniform distribution.
- **Posterior**: Using Bayes' theorem, the posterior distribution of $p$ after observing $k$ heads in $N$ flips is $Beta(\alpha + k, \beta + N - k)$.

### 1.5 Solving the Problem

1. **Importance Sampling (Randomized Approach)**
   - We could generate samples of $p$ from a proposal distribution (e.g., another Beta distribution with different parameters) and then weight these samples according to how likely they are given our data and our actual prior. This would give us an approximation of the posterior distribution.

2. **MCMC - Gibbs Sampling (Randomized Approach)**
   - Since our model is simple, we directly sample from the posterior, but in more complex models, we could use Gibbs sampling if the conditional distributions are known and tractable. This would involve iteratively sampling each parameter from its distribution conditional on the other parameters.

3. **Variational Inference - Mean Field (Deterministic Approach)**
   - We could approximate the posterior distribution with a simpler distribution, such as another Beta distribution. By minimizing the KL divergence between this approximating distribution and the true posterior, we can find the parameters of this simpler distribution that best approximate our true posterior.

4. **Loopy Belief Propagation (Deterministic Approach)**
   - While not directly applicable in this simple problem, in a more complex graphical model, loopy belief propagation could be used to approximate marginal distributions, including the marginal distribution of $p$.

5. **Exact Methods**
   - For our problem, exact inference is straightforward since the Beta-Binomial conjugacy allows for a direct calculation of the posterior distribution, which is $Beta(\alpha + k, \beta + N - k)$. After observing $k = 7$ heads out of $N = 10$ flips, our posterior for $p$ is $Beta(1 + 7, 1 + 10 - 7) = Beta(8, 4)$.

### 1.6 Conclusion

In this specific scenario, the exact method provides a straightforward solution, yielding a Beta(8, 4) posterior distribution for $p$. However, the various approximate methods offer powerful alternatives for more complex problems where exact inference is not feasible. Each approach has its own strengths and trade-offs, with deterministic approaches generally offering speed and scalability at the cost of some precision, while randomized approaches offer more accuracy at the cost of increased computational requirements.

## 2. Learning in Bayesian Networks

Learning in Bayesian Networks involves either estimating the parameters of the network (the probabilities in the Conditional Probability Tables, CPTs) when the structure is known, or learning the structure of the network itself from data. The learning process is generally about updating our beliefs based on observed data, which in the context of Bayesian Networks, can be described using Bayes' theorem for parameter learning.

### Equation for Learning: Parameter Learning

When the network structure is known, but we need to estimate the parameters (probabilities in CPTs), we use the data to update our beliefs about these parameters. The Bayesian approach to parameter learning can be summarized by the posterior distribution of the parameters $\theta$ given data $D$:

$ P(\theta | D) = \frac{P(D | \theta)P(\theta)}{P(D)} $

where:
- $P(\theta | D)$ is the posterior probability of the parameters given the data,
- $P(D | \theta)$ is the likelihood of the data given the parameters,
- $P(\theta)$ is the prior probability of the parameters, and
- $P(D)$ is the probability of the data, serving as a normalization constant.

### Numerical Example: Parameter Learning

Let's say we have a simple Bayesian Network with a single binary variable $A$, and we want to estimate the probability $P(A=true)$ based on observed data. Suppose we have observed $A=true$ in 7 out of 10 instances.

If we assume a prior distribution for $P(A=true)$ that is uniform (i.e., all probabilities are equally likely, which is equivalent to a Beta distribution $Beta(1, 1)$), then our prior belief about $P(A=true)$ is that it has a mean of $0.5$.

The likelihood of observing our data given $P(A=true)=p$ follows a binomial distribution, and after observing 7 "true" out of 10 instances, the posterior distribution of $P(A=true)$ can be updated using the Beta distribution properties:

The posterior distribution is $Beta(\alpha + \text{successes}, \beta + \text{failures})$, where $\alpha=1$ and $\beta=1$ are the parameters of the prior distribution.

Let's calculate the updated belief about $P(A=true)$.

### Approaches to Learning

1. **Parameter Learning**:
    - **Maximum Likelihood Estimation (MLE)**: Estimates parameters by maximizing the likelihood of the observed data.
    - **Bayesian Estimation**: Updates the belief about parameters in a way that incorporates prior knowledge, using observed data.

2. **Structure Learning**:
    - **Constraint-based methods**: Infer the structure by testing conditional independence among variables.
    - **Score-based methods**: Assign a score to different structures based on how well they fit the data and search for the structure with the best score.
    - **Hybrid methods**: Combine constraint-based and score-based approaches to leverage the strengths of both.

Let's perform the calculation for the parameter learning example mentioned.

After observing $7$ instances of $A=true$ out of $10$, our updated belief about $P(A=true)$ is represented by a posterior Beta distribution with parameters $Beta(8, 4)$. The mean of this posterior distribution is $0.667$, indicating that, given the observed data and our prior, our best estimate for $P(A=true)$ is $66.7\%$.

This numerical example demonstrates parameter learning in a Bayesian Network for a simple case. The Bayesian approach allows us to update our beliefs about the parameters in a principled manner, incorporating both prior knowledge and observed data.

## 3. Reasoning under uncertainty
Reasoning under uncertainty in the context of Bayesian Networks involves using the network to make inferences about unknown variables given some evidence. This is fundamentally about calculating posterior probabilities of certain hypotheses given observed evidence, and it's deeply tied to the concepts of inference discussed earlier.

### Equation for Reasoning under Uncertainty: Total Probability and Bayes' Theorem

A core principle in reasoning under uncertainty is the law of total probability, combined with Bayes' theorem. For a hypothesis $H$ and evidence $E$, Bayes' theorem provides a way to update our belief about $H$ given $E$:

$ P(H|E) = \frac{P(E|H)P(H)}{P(E)} $

where:
- $P(H|E)$ is the posterior probability of $H$ given $E$,
- $P(E|H)$ is the likelihood of observing $E$ given $H$,
- $P(H)$ is the prior probability of $H$,
- $P(E)$ is the probability of observing $E$, which can be calculated using the law of total probability if $E$ can be influenced by multiple hypotheses.

### Numerical Example: Reasoning under Uncertainty

Imagine a scenario where you are trying to diagnose whether a patient has a disease based on a test result. Let's denote:
- $D$ as the event that the patient has the disease,
- $\neg D$ as the event that the patient does not have the disease,
- $T+$ as the event that the test result is positive,
- $P(D) = 0.01$ (prior probability that a randomly selected patient has the disease),
- $P(T+|D) = 0.95$ (probability of a positive test result if the patient has the disease),
- $P(T+|\neg D) = 0.05$ (probability of a positive test result if the patient does not have the disease).

Given a positive test result ($T+$), we want to calculate the probability that the patient actually has the disease ($P(D|T+)$).

Using Bayes' theorem:

$ P(D|T+) = \frac{P(T+|D)P(D)}{P(T+)} $

Where $P(T+)$ can be calculated using the law of total probability:

$ P(T+) = P(T+|D)P(D) + P(T+|\neg D)P(\neg D) $

Let's calculate $P(D|T+)$.

### Approaches to Reasoning under Uncertainty

- **Exact Inference**: Using methods like variable elimination or the junction tree algorithm to compute exact posterior probabilities.
- **Approximate Inference**: Employing sampling techniques or variational methods to estimate posterior probabilities when exact calculation is infeasible.
- **Qualitative Reasoning**: Leveraging qualitative judgments about the relationships between variables (e.g., causal reasoning) when quantitative data is sparse.

Let's perform the calculation for the given example.

Given a positive test result ($T+$), the probability that the patient actually has the disease ($P(D|T+)$) is approximately $0.161$ or $16.1\%$.

This numerical example illustrates reasoning under uncertainty, demonstrating how a seemingly high probability of a positive test result given the disease ($95\%$) translates into a much lower posterior probability of having the disease given a positive test result, due to the low prior probability of the disease ($1\%$) and the possibility of false positives. This is a classic example of how Bayesian reasoning helps in making informed decisions under uncertainty.


## 4. Decision making and prediction Bayesian Networks
Decision making and prediction in the context of Bayesian Networks involve using the network to evaluate the outcomes of different decisions under uncertainty and to make predictions about future events. This typically involves calculating expected utilities or probabilities for different actions or outcomes and selecting the best action based on these calculations.

### Equation for Decision Making and Prediction: Expected Utility

The fundamental concept in decision making under uncertainty is the Expected Utility Theory, which suggests that the best decision is the one that maximizes the expected utility. The expected utility ($EU$) of a decision $D$ can be calculated as:

$ EU(D) = \sum_{i} P(O_i|D) \cdot U(O_i) $

where:
- $P(O_i|D)$ is the probability of outcome $O_i$ given decision $D$,
- $U(O_i)$ is the utility of outcome $O_i$,
- The sum is over all possible outcomes $O_i$.

Utilities represent the decision maker's preference for different outcomes, quantifying the value or satisfaction from outcomes.

### Numerical Example: Decision Making and Prediction

Suppose you're managing a small investment portfolio and have to decide whether to invest in a new tech startup. You consider two decisions: invest ($D_1$) or not invest ($D_2$). The outcomes depend on whether the startup is successful ($S$) or not ($N$). Let's say:
- $P(S|D_1) = 0.6$, $P(N|D_1) = 0.4$ (probabilities of success and not success if you invest),
- $P(S|D_2) = 0$, $P(N|D_2) = 1$ (probabilities if you do not invest),
- $U(S) = 100$ (utility if the startup is successful),
- $U(N) = -50$ (utility if the startup is not successful; you lose some market opportunity even if you don't invest).

To make a decision, calculate the expected utility for both decisions.

### Approaches to Decision Making and Prediction

- **Expected Utility Maximization**: Directly applying the expected utility theory to make decisions that maximize the expected utility.
- **Value of Information**: Calculating the value of obtaining additional information before making a decision, to reduce uncertainty.
- **Risk Analysis and Management**: Incorporating risk preferences and using tools like sensitivity analysis to understand how uncertainties affect decisions.

Let's calculate the expected utilities for both decisions in the given example.

For the decision to invest ($D_1$), the expected utility is $40$. For the decision not to invest ($D_2$), the expected utility is $-50$.

Given these calculations, the decision that maximizes the expected utility is to invest in the new tech startup, as it has a higher expected utility ($40$) compared to not investing ($-50$). This example illustrates how to apply expected utility theory to make decisions under uncertainty, highlighting the importance of weighing the probabilities of different outcomes against their respective utilities.

In [1]:
# Utilities
U_S = 100
U_N = -50

# Probabilities with decision to invest
P_S_given_D1 = 0.6
P_N_given_D1 = 0.4

# Probabilities without decision to invest
P_S_given_D2 = 0
P_N_given_D2 = 1

# Calculating expected utilities
EU_D1 = P_S_given_D1 * U_S + P_N_given_D1 * U_N
EU_D2 = P_S_given_D2 * U_S + P_N_given_D2 * U_N

EU_D1, EU_D2


(40.0, -50)

# Bayesian network (Bayes network, belief network, or decision network, Directed graphical models)

It is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG).

A compact Bayesian network is a distribution in which each factor on the right hand side depends only on a small number of ancestor variables $x_{A_i}$

$p(x_i \mid x_{i-1}, \dotsc, x_1) = p(x_i \mid x_{A_i}).$

For example, in a model with five variables, we may choose to approximate the factor $p(x_5 \mid x_4, x_3, x_2, x_1)$ with $p(x_5 \mid x_4, x_3)$, meaning $x_{A_5} = \{x_4, x_3\}$.

### Examples


<img width="300" height="200" src='../probability/images/grade-model.png'>

$p(l, g, i, d, s) = p(l \mid g)\, p(g \mid i, d)\, p(i)\, p(d)\, p(s \mid i).$

<img src="../probability/images/win_rain_wet.jpg" alt="" />


$P(L,R,W)=P(L)P(R)P(W|R)$


<img src="../probability/images/rain_wet_car_slip.jpg" alt="" />

$P(R,W,C,S)=P(R)P(C)P(W|C,R)P(S|W)$

<img src="../probability/images/SimpleBayesNet.svg" alt="" />

The chain rule will give us the followings:


${\displaystyle P(G,S,R)=P(G\mid S,R)P(S\mid R)P(R)}$


What is the probability that it is raining, given the grass is wet? By applying bayes rule and then marginalisation:

${\displaystyle P(R=T\mid G=T)={\frac {P(G=T,R=T)}{P(G=T)}}={\frac {\sum _{x\in \{T,F\}}P(G=T,S=x,R=T)}{\sum _{x,y\in \{T,F\}}P(G=T,S=x,R=y)}}}$

Now using the expansion for the joint probability function ${\displaystyle \Pr(G,S,R)}$ and the conditional probabilities from the conditional probability tables:


${\displaystyle {\begin{aligned}P(G=T,S=T,R=T)&=P(G=T\mid S=T,R=T)P(S=T\mid R=T)P(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}}$


${\displaystyle P(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.}$


Refs: [1](https://www.youtube.com/watch?v=TuGDMj43ehw)