# Oil Field Decision Problem

This notebook delves into the decision-making process for an oil company evaluating the purchase of an oil field. We will leverage two powerful frameworks: decision trees and influence diagrams.

**Decision Trees:**

* We will begin by constructing a decision tree to analyze potential outcomes and guide the company towards the most profitable choice.
* Through the decision tree, we will assess factors like oil field quality, geological test results, and potential uncertainties.
* By assigning monetary values (utilities) to each scenario, we will quantify potential profits and estimate probabilities of various outcomes.
* Ultimately, we will use these calculations to identify the decision sequence that maximizes the expected profit for the oil company.

**Influence Diagrams with [PyAgrum](https://pyagrum.readthedocs.io/en/0.16.1/):**

* Next, we will explore how to model the same problem using an influence diagram with the PyAgrum package.
    * Influence diagrams offer a more visually intuitive representation of decision problems, explicitly depicting relationships between variables.
* By comparing the decision tree and influence diagram approaches, we will gain a deeper understanding of their strengths and limitations for this specific problem.

**Future Developments: Integrating Large Language Models (LangChain):**

* As a potential future development, we will briefly discuss how Large Language Models (LLMs) could be integrated into the decision process using LangChain.
    * LLMs have the potential to automate tasks like data analysis and information retrieval, potentially enhancing the efficiency of the decision-making process.

## 1 - Introduction

An oil company is considering the <span style="color: red"><b>decision</b></span> (<span style="color: red"><b>B</b></span>) to buy an oil field. The oil field can have three quality levels (<span style="color: purple"><b>Q</b></span>): high ($q_1$), medium ($q_2$), and low ($q_3$). The company obviously does not know the "real" qaulity of the field beforehand, but it can provide an estimation (i.e., <span style="color: purple"><b>uncertainty</b></span> ) using historical data and imagery. **It is willing to pay a higher price for the field as its quality increases**.

Before making the buy decision, the company needs to <span style="color: red"><b>decide</b></span> (<span style="color: red"><b>T</b></span>) if it wants to perform a geological test. This test will have a certain cost and its results (<span style="color: purple"><b>R</b></span>) will not be exact predictions about the quality of the field, but will provide a report on the porosity of the reservoir (high porosity generally indicates greater oil potential). The test will not be infallible, and thus contain a certain degree of <span style="color: purple"><b>uncertainty</b></span>. The test can have two possible outcomes:

* **Pass:** The porosity of the reservoir rock is equal to or greater than 15%, indicating significant oil potential.
* **Fail:** The porosity of the reservoir rock is less than 15%, indicating low oil potential.

<table>
<tr>
  <td>
    <img src="./images_1/rock_porosity.jpg" alt="Rock Porosity examples" width="600">
  </td>
</tr>
<tr>
<i><b>Figure 1.</b> Hydrocarbon reservoir quality in terms of permeability and porosity </i>
</tr>
</table>

The chronological sequence of the decision process is as follows:

1. The company decides whether or not to perform the geological test.
2. If the test is performed, the results are observed.
3. The company decides whether or not to buy the oil field.

There is still residual uncertainty in the problem that affects utility: <span style="color: purple"><b>What is the actual state of the oil field?</b></span>

In this example, it seems logical for the company to buy the oil field after obtaining a "pass" result, but this is not always the case. It will depend on its specific a priori beliefs about the quality of the land (for example, based on its historical data on oil fields with similar characteristics), the intrinsic uncertainty of the test (for example, the test may give a positive result but the field is not actually suitable, or vice versa) and how the company values the possible consequences.

## 2 - Quantitative information

In this case, the quantitative information will be specified explicitly to illustrate these ideas. 

### 2.1 - The utility table ($U$)

To evaluate the decision tree for the oil field, we need to define the value (utility) of each outcome. This utility reflects the desirability of a particular scenario (buying the field after a successful test, etc.).

There are several ways to define utilities, and the best approach depends on the situation. Here's how we'll approach it:

* **Monetary Values:** We'll primarily focus on the net profit (revenue minus costs) associated with each outcome. This makes sense because the oil company is likely driven by profitability.
* **Potential Adjustments:** We might consider incorporating non-monetary factors like environmental impact in the future. For example, buying a low-quality field might have a lower environmental impact (less drilling required) compared to a high-quality one. We could then adjust the utilities to reflect this.

After several discussions, these are the resulting utilities of the problem:

<table>
  <tr>
    <th><span style="color: red">T</span></th>
    <th><span style="color: red">B</span></th>
    <th><span style="color: purple">Q</span></th>
    <th><span style="color: blue">U</span></th>
  </tr>
  <tr>
    <td rowspan="6">do</td>
    <td rowspan="3">buy</td>
    <td>high</td>
    <td>0.85</td>
  </tr>
  <tr>
    <td>medium</td>
    <td>0.43</td>
  </tr>
  <tr>
    <td>low</td>
    <td>0</td>
  </tr>
  <tr>
    <td rowspan="3">not buy</td>
    <td>high</td>
    <td>0.25</td>
  </tr>
  <tr>
    <td>medium</td>
    <td>0.25</td>
  </tr>
  <tr>
    <td>low</td>
    <td>0.25</td>
  </tr>
  <tr>
    <td rowspan="6">not do</td>
    <td rowspan="3">buy</td>
    <td>high</td>
    <td>0.86</td>
  </tr>
  <tr>
    <td>medium</td>
    <td>0.44</td>
  </tr>
  <tr>
    <td>low</td>
    <td>0.01</td>
  </tr>
  <tr>
    <td rowspan="3">not buy</td>
    <td>high</td>
    <td>0.26</td>
  </tr>
  <tr>
    <td>medium</td>
    <td>0.26</td>
  </tr>
  <tr>
    <td>low</td>
    <td>0.26</td>
  </tr>
</table>

### 2.2 - The prior probability distribution of oil field quality ($Q$)

This table shows the prior probability distribution of the oil field quality, represented by the variable Q. Prior probability refers to the likelihood of encountering each quality level before any observations are made. These probabilities represent the company's belief on the oil field quality. 

<table>
  <tr>
    <th colspan="2"><span style="color: purple">Q</span></th>
  </tr>
  <tr>
    <td>high</td>
    <td>0.3</td>
  </tr>
  <tr>
    <td>medium</td>
    <td>0.4</td>
  </tr>
  <tr>
    <td>low</td>
    <td>0.3</td>
  </tr>
</table>

We can imagine these probabilities were estimated based on historical information from the oil company's past exploration of similar oil fields. For example, the company could have a classification model that predicts the oil field quality by using [satellite imagery and geographical location data](https://www.satimagingcorp.com/applications/energy/exploration/oil-exploration/).

<table>
  <tr>
  <td>
    <img src="./images_1/oil_field_image.jpg" alt="Oil field image" width="400">
  </td>
    <td>
    <img src="./images_1/oil_field_heatmap.jpg" alt="Oil field heatmap" width="400">
  </td>
  </tr>
</table>

### 2.3 - The conditional probability distribution of the porosity test result (<span style="color: purple"><b>R</b></span>)

The results of the porosity test are directly related to the actual quality of the oil field (<span style="color: purple"><b>Q</b></span>). In a perfect scenario, the test would be highly accurate:

* If the oil field is of **high** quality (<span style="color: purple"><b>Q</b></span> = high), the test result would be "pass" (<span style="color: purple"><b>R</b></span> = pass) with a probability close to 1.
* If the oil field is of **low** quality (<span style="color: purple"><b>Q</b></span> = low), the test result would be "pass" with a probability close to 0.
* If the oil field is of **medium** quality (<span style="color: purple"><b>Q</b></span> = medium), the test result could be "pass" or "fail" with an equal probability (0.5).


However, real-world tests are not perfect. The table below introduces these measurement imperfections by showing the conditional probability of each test result <span style="color: purple"><b>R</b></span> (pass or fail) given the actual quality of the oil field.

<table>
  <tr>
    <th><span style="color: purple">R</span></th>
    <th>high</th>
    <th>medium</th>
    <th>low</th>      
  </tr>
  <tr>
    <td>pass</td>
    <td>0.9</td>
    <td>0.65</td>
    <td>0.15</td>      
  </tr>
  <tr>
    <td>fail</td>
    <td>0.1</td>
    <td>0.35</td>
    <td>0.85</td> 
  </tr>
</table>

## 3 - Modeling the problem with a decision tree

A decision tree is a visual tool that maps out different decision points and their consequences. In this case, we'll use it to model the oil field problem, considering:

* **Uncertainties**. Corresponding to the oil field quality and the accuracy of the geological test
* **Decisions**. The company needs to decide on two things: whether to perform the test and, ultimately, whether to buy the field based on the available information.

<table>
<tr>
  <td>
    <img src="./images_1/oil_decision_tree.png" alt="Oil decision tree" width="600">
  </td>
</tr>
<tr>
<i><b>Figure 2.</b> Oil decision tree </i>
</tr>
</table>

It is worth noting that any structural asymmetry is explicitly reflected in the decision tree. For example, the test results will only be obtained if the company decides to perform the test.

### 3.1 - Evaluating the tree to find the optimal policy

In order to evaluate the decision tree we need:
* The marginal probability distribution of R.
* The conditional probabilities of Q | R (rather than R | Q), which can be estimated using the **Bayes' Theorem**.

$$
\begin{align*}
P(R = \text{pass}) &= P(R = \text{pass} | Q = \text{high}) P(Q = \text{high}) + P(R = \text{pass} | Q = \text{medium}) P(Q=\text{medium}) + P(R = \text{pass} | Q = \text{low}) P(Q=\text{low}) \\
&= 0.9 \times 0.3 + 0.65 \times 0.4 + 0.15 \times 0.3 = 0.575 \\
P(Q = \text{high} | R = \text{pass}) &= \frac{P(R = \text{pass} | Q = \text{high}) P(Q = \text{high})}{P(R = \text{pass})} = \frac{0.9 \times 0.3}{0.575} = 0.47 \\
P(Q = \text{medium} | R = \text{pass}) &= \frac{P(R = \text{pass} | Q = \text{medium}) P(Q = \text{medium})}{P(R = \text{pass})} = \frac{0.65 \times 0.4}{0.575} = 0.452\\
P(Q = \text{low} | R = \text{pass}) &= \frac{P(R = \text{pass} | Q = \text{low}) P(Q = \text{low})}{P(R = \text{pass})} = \frac{0.15 \times 0.3}{0.575} = 0.078
\end{align*}
$$

Anagolously, we would need estimate the conditional probabilities when $R = \text{fail}$:

$$
\begin{align*}
P(R = \text{fail}) &= 1 - P(R = \text{pass}) = 1 - 0.575 = 0.425\\
P(Q = \text{high} | R = \text{fail})  &= 1 - P(Q = \text{high} | R = \text{pass})  = 1 - 0.47 = 0.53\\
P(Q = \text{medium} | R = \text{fail})  &= 1 - P(Q = \text{medium} | R = \text{pass})  = 1 - 0.452 = 0.548\\
P(Q = \text{low} | R = \text{fail})  &= 1 - P(Q = \text{low} | R = \text{pass})  = 1 - 0.078 = 0.922\\
\end{align*}
$$

Once the probabilities have been estimated, we calcualte the expected utility of each decision node. This is done by:

* Multiplying the payoff of each outcome by its probability.
* Summing these products for all possible outcomes under that decision.

The decision with the highest expected value is considered the optimal choice. For example, to estimate the expected utility of buying the oil field if we have done the test and it has been succesful would be:

$$
0.470 \times 0.85 + 0.452 \times 0.43 + 0 = 0.594
$$

Moving backwards, in the decision node B, as $0.594 > 0.25$, that wold tell us that i that situation it is better to buy than not buying. Then, we keep repeating this process backwards in the tree until we reach the root node, thus obtaining the optimal policy (i.e., the one with the highest expected utility). 

The provided figure (Figure 3) illustrates an evaluated decision tree with payoffs and expected values. It indicates that doing the test is the optimal first step. Based on the test results:

* If the test is successful ("pass"), buying the oil field has the highest expected value.
* If the test fails, not buying the oil field is the better option.


<table>
<tr>
  <td>
    <img src="./images_1/evaluated_oil_decision_tree.png" alt="Evaluated oil decision tree" width="600">
  </td>
</tr>
<tr>
<i><b>Figure 3.</b> Evaluated oil decision tree </i>
</tr>
</table>

This evaluation helps the oil company make informed decisions by considering the uncertainties involved and maximizing their expected utility.

## 4 - Influence diagrams

### 4.1 - Limitations of decision trees

Decision trees are powerful tools for visualizing and analyzing decision-making problems. However, they have some limitations:

* **Complexity:** Large decision trees can become cumbersome and difficult to interpret, especially with many variables and outcomes.
* **Limited Dependence Modeling:** Decision trees struggle to represent complex relationships between variables. Each branch in the tree depicts a single chain of events, neglecting potential interactions or dependencies.
* **Intractability:** As the number of variables increases, the number of possible branches in the tree explodes exponentially, making it computationally expensive and unwieldy.
* **Limited Incorporation of New Information:** Updating a decision tree often requires significant changes if new information or variables become available.


### 4.2 - Advantages of influence diagrams

Influence diagrams address some of the shortcomings of decision trees:

* **Clarity and Transparency:** Influence diagrams use a graphical format with nodes and arrows to represent variables and their relationships. This visual approach is often easier to understand than complex decision trees.
* **Explicit Dependence Modeling:** Influence diagrams can explicitly show how variables influence each other. Arrows depict these dependencies, allowing for a more comprehensive understanding of the problem structure.
* **Scalability:** Influence diagrams are generally more scalable than decision trees. They can handle complex problems with numerous variables while maintaining clarity.
* **Flexibility:** Updating an influence diagram is often easier than modifying a decision tree. New information can be incorporated by adding or modifying nodes and arrows.

### 4.3 - Formal definition

An **influence diagram** (ID), also known as a relevance diagram, decision diagram, or decision network, is a directed acyclic graph (DAG) represented as $G = (N, E)$, where:

**Nodes:**

The set of nodes $N$ is partitioned into three subsets:

* **Decision nodes** ($D$): Represented by squares, these nodes represent points in the model where a decision-maker can choose between alternative courses of action.
* **Chance nodes** ($C$): Represented by circles, these nodes represent uncertain events or states of nature that can influence the outcome of the decision.
* **Value nodes** ($V$): Represented by diamonds, these nodes represent the expected utilities or outcomes associated with the decision choices.

**Edges:**

The set of edges $E$ includes two types of arcs, depending on the type of node they point to:

* **Informational Arcs**. These arcs connect to decision nodes and indicate temporal precedence. In other words, the variable at the origin of the arc represents information that is available and known at the time the decision is made at the destination node.
* **Conditional Arcs**. These arcs connect to value or chance nodes and represent dependencies, either functional or probabilistic, on the values of the parent nodes. They do not imply causality or temporal precedence.

## 5 - Modeling the decision problem with an influence diagram

La modelización tiene lugar a varios niveles. Tal y como hicimos con el árbol de decisión, al dibujar el grafo estariamos representando la descripcion cualitativa del probelma (nivel estrucutral o grafico). Luego, procederiamos a incluir la información cuantitativa (nivel numerico) para dejar el diagram totalmente especificado.



In [1]:
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
from pyAgrum import InfluenceDiagram

Graphviz is not installed. 
  Please install this program in order to visualize graphical models in pyAgrum. 
  See https://graphviz.org/download/


In [2]:
influence_diagram = InfluenceDiagram()
influence_diagram.addChanceNode("Q")
gnb.sideBySide(influence_diagram)

FileNotFoundError: [Errno 2] "dot" not found in path.