# Backpropagation Algorithm

### Recall computation graphs
Recall that a computation graph is a directed acyclic graph (DAG) where:
1.  Nodes in the graph represent variables (inputs, parameters, intermediate values, the final output $L$).
3.  Edges represent functions/operations that produce an output node from input node(s). For an operation `y = f(u, v)`, there are incoming edges from nodes `u` and `u` to node `y`.
4.  Each variable/node `u` in the graph will have two associated values:
    *   `u.data`: The numerical value computed during the forward pass.
    *   `u.grad`: Stores the computed partial derivative $\frac{\partial L}{\partial u}$. Initialized appropriately before the backward pass.

### The  Algorithm (Reverse-Mode Automatic Differentiation)

**Goal:** Given a computation graph representing a scalar function `L` (e.g., the loss) that depends on some input parameters (e.g., `w1, w2, \dots, wn`), efficiently compute the partial derivatives (gradients) of `L` with respect to each parameter.

**Algorithm Steps:** (Note: below, `dy/dx` denotes the partial derivative of `y` with respect to `x`.)

**1. Forward Pass:**

*   Evaluate the graph from inputs to the final output `L`.
*   For each node `u`, compute and store its value `u.data` based on the values of its predecessors and the operation performed.
    *   Example: If `u = v + w`, then `u.data = v.data + w.data`.
*   This pass computes the final value `L.data`. We need the intermediate `data` values stored at each node for the backward pass.

**2. Backward Pass Initialization:**

*   Initialize the `.grad` attribute for all nodes in the graph to zero.
    ```python
    # Conceptual initialization
    for node u in graph:
        u.grad = 0.0
    ```
*   Set the gradient of the final output node `L` with respect to itself to one.
    ```python
    L.grad = 1.0  # Represents dL/dL = 1
    ```

**3. Backward Pass Iteration:**

*   Process the nodes in **reverse topological order** (starting from the final node `L` and moving backward towards the inputs/parameters). This ensures that when we process a node `u`, the gradient `u.grad` (representing `dL/du`) has already been fully computed by accumulating contributions from all paths downstream from `u`.
*   For each node `u` being processed:
    *   We have the accumulated gradient `dL/du` stored in `u.grad`.
    *   Consider all nodes `v` that were **inputs** to the operation that produced `u`. Let the operation be `u = f(v_1, v_2, ..., v_k)`.
    *   For each input node `v_i` to the operation `f`:
        *   Compute the **local derivative** of the operation's output (`u`) with respect to this specific input (`v_i`), evaluated at the `.data` values computed during the forward pass. Let's call this `du/dv_i`.
            *   Example: If `u = v_1 * v_2`, then `du/dv_i = v_2.data`.
            *   Example: If `u = sin(v_1)`, then `du/dv_i = cos(v_1.data)`.
        *   Apply the chain rule to find the contribution of `u.grad` to the gradient of `v_i` flowing *through* `u`:
            ```python
            contribution = u.grad * `du/dv_i = 
            ```
        *   **Accumulate** this contribution into the gradient of the input node `v_i`:
            ```python
            v_i.grad += contribution
            # This performs: dL/dv_i = (dL/dv_i)_old + (dL/du) * (du/dv_i)
            ```
            The `+=` operation is crucial. It ensures that if `v_i` is an input to multiple operations (i.e., it has multiple outgoing edges in the forward graph), its total gradient `dL/dv_i` correctly sums the contributions from all paths flowing back to it.

**4. Final Result:**

*   After processing all nodes in reverse topological order, the `.grad` attribute of each input parameter node `w_i` will contain the desired total partial derivative `dL/dw_i`.

**Why it Works (Connection to Differentials):**

This algorithm systematically computes the coefficients in the total differential `dL`. At each step, when computing `v_i.grad += u.grad * (du / dv_i)`, we are essentially substituting the expression for `du` into `dL`. The final `w_i.grad` values are the coefficients of `dw_i` in the full expression for $dL$ solely in terms of the input parameter differentials.

### Numerical Example: `L = log(x1*x2) + sin(x1*x3) + x2*x3`

Let's compute the gradients `dL/dx1`, `dL/dx2`, and `dL/dx3` using backpropagation.

**Inputs:** Let 
- `x1 = 2.0` 
- `x2 = 3.0` 
- `x3 = 4.0`.

We assume `log` is the natural logarithm and `sin`/`cos` use radians.

**1. Define Computation Graph Nodes & Operations:**

*   `x1`, `x2`, `x3` (Inputs/Parameters)
*   `a = x1 * x2`
*   `b = log(a)`
*   `c = x1 * x3`
*   `d = sin(c)`
*   `e = x2 * x3`
*   `f = b + d`
*   `L = f + e` (Final Output)

**2. Forward Pass:** Calculate `.data` for each node.

*   `x1.data = 2.0`
*   `x2.data = 3.0`
*   `x3.data = 4.0`
*   `a.data = 2.0 * 3.0 = 6.0`
*   `b.data = log(6.0) ≈ 1.7918`
*   `c.data = 2.0 * 4.0 = 8.0`
*   `d.data = sin(8.0) ≈ 0.9894`
*   `e.data = 3.0 * 4.0 = 12.0`
*   `f.data = 1.7918 + 0.9894 = 2.7812`
*   `L.data = 2.7812 + 12.0 = 14.7812`

**3. Backward Pass Initialization:**

*   Initialize `.grad = 0.0` for nodes `x1, x2, x3, a, b, c, d, e, f`.
*   Set `L.grad = 1.0`.

**4. Backward Pass Iteration (Reverse Topological Order):**

Processing order: `L, f, e, d, b, c, a, x3, x2, x1`

*   **Node L:** (`L = f + e`)
    *   Gradient `dL/dL` is `L.grad = 1.0`.
    *   Inputs: `f`, `e`. 
    *   Local derivatives: `dL/df = 1`, `dL/de = 1`.
    *   Propagate:
        *   `f.grad += L.grad * (dL/df) = 0.0 + 1.0 * 1 = 1.0`
        *   `e.grad += L.grad * (dL/de) = 0.0 + 1.0 * 1 = 1.0`
    *   *Current Gradients*: 
        - `f.grad=1.0` 
        - `e.grad=1.0`

*   **Node f:** (`f = b + d`)
    *   Gradient `dL/df` is `f.grad = 1.0`.
    *   Inputs: `b`, `d`. 
    *   Local derivatives: `df/db = 1`, `df/dd = 1`.
    *   Propagate:
        *   `b.grad += f.grad * (df/db) = 0.0 + 1.0 * 1 = 1.0`
        *   `d.grad += f.grad * (df/dd) = 0.0 + 1.0 * 1 = 1.0`
    *   *Current Gradients*: 
        - `b.grad=1.0` 
        - `d.grad=1.0`
        - `e.grad=1.0`

*   **Node e:** (`e = x2 * x3`)
    *   Gradient `dL/de` is `e.grad = 1.0`.
    *   Inputs: `x2`, `x3`. 
    *   Local derivatives: `de/dx2 = x3.data = 4.0`, `de/dx3 = x2.data = 3.0`.
    *   Propagate:
        *   `x2.grad += e.grad * (de/dx2) = 0.0 + 1.0 * 4.0 = 4.0`
        *   `x3.grad += e.grad * (de/dx3) = 0.0 + 1.0 * 3.0 = 3.0`
    *   *Current Gradients*: 
        - `b.grad=1.0` 
        - `d.grad=1.0` 
        - `x2.grad=4.0` 
        - `x3.grad=3.0`

*   **Node d:** (`d = sin(c)`)
    *   Gradient `dL/dd` is `d.grad = 1.0`.
    *   Input: `c`. 
    *   Local derivative: `dd/dc = cos(c.data) = cos(8.0) ≈ -0.1455`.
    *   Propagate:
        *   `c.grad += d.grad * (dd/dc) = 0.0 + 1.0 * (-0.1455) = -0.1455`
    *   *Current Gradients*: 
        - `b.grad=1.0` 
        - `c.grad=-0.1455` 
        - `x2.grad=4.0` 
        - `x3.grad=3.0`

*   **Node b:** (`b = log(a)`)
    *   Gradient `dL/db` is `b.grad = 1.0`.
    *   Input: `a`. Local derivative: `db/da = 1 / a.data = 1 / 6.0 ≈ 0.1667`.
    *   Propagate:
        *   `a.grad += b.grad * (db/da) = 0.0 + 1.0 * (0.1667) = 0.1667`
    *   *Current Gradients*: 
        - `a.grad=0.1667`
        - `c.grad=-0.1455` 
        - `x2.grad=4.0` 
        - `x3.grad=3.0`

*   **Node c:** (`c = x1 * x3`)
    *   Gradient `dL/dc` is `c.grad ≈ -0.1455`.
    *   Inputs: `x1`, `x3`. 
    *   Local derivatives: `dc/dx1 = x3.data = 4.0`, `dc/dx3 = x1.data = 2.0`.
    *   Propagate:
        *   `x1.grad += c.grad * (dc/dx1) = 0.0 + (-0.1455) * 4.0 ≈ -0.5820`
        *   `x3.grad += c.grad * (dc/dx3) = 3.0 + (-0.1455) * 2.0 = 2.7090` (Accumulation!)
    *   *Current Gradients*: 
        - `a.grad=0.1667` 
        - `x1.grad=-0.5820` 
        - `x2.grad=4.0` 
        - `x3.grad=2.7090`

*   **Node a:** (`a = x1 * x2`)
    *   Gradient `dL/da` is `a.grad ≈ 0.1667`.
    *   Inputs: `x1`, `x2`. 
    *   Local derivatives: `da/dx1 = x2.data = 3.0`, `da/dx2 = x1.data = 2.0`.
    *   Propagate:
        *   `x1.grad += a.grad * (da/dx1) = -0.5820 + (0.1667) * 3.0 ≈ -0.0819` (Accumulation!)
        *   `x2.grad += a.grad * (da/dx2) = 4.0 + (0.1667) * 2.0 ≈ 4.3334` (Accumulation!)
    *   *Current Gradients*: 
        - `x1.grad=-0.0819` 
        - `x2.grad=4.3334` 
        - `x3.grad=2.7090`

*   **Nodes x3, x2, x1:** These are input nodes. All incoming gradient paths have been processed.

**5. Final Result:**

The computed gradients are stored in the `.grad` attributes of the input nodes:

*   `dL/dx1 = x1.grad ≈ -0.0819`
*   `dL/dx2 = x2.grad ≈ 4.3334`
*   `dL/dx3 = x3.grad ≈ 2.7090`

### Implementing Backpropagation in Python

There is a natural recursive structure to the backward pass that makes for a fun implementation problem. To illustrate the concept without getting bogged down in the details (especially, the complications introduced by using tensors), let's implmement Karpathy's `micrograd` library.

#### High-Level Structure of micrograd

The `micrograd` library is designed for automatic differentiation, specifically using the reverse-mode algorithm (backpropagation). Its core structure revolves around a single primary class:

1.  **The `Value` Class:**
    *   This is the central data structure in `micrograd`.
    *   Each `Value` object represents a single **scalar node** within a computation graph. It could be an input variable, a parameter, an intermediate result, or the final output (like a loss function).
    *   **Key Attributes:**
        *   `data`: Stores the actual numerical floating-point value computed during the forward pass.
        *   `grad`: Stores the gradient of the *final* scalar output of the computation graph with respect to this `Value` object's `data`. It's initialized to `0.0` and gets accumulated during the backward pass.
        *   `_prev`: A set containing the parent `Value` objects (the inputs) that were used in the operation that created *this* `Value` object. This links nodes together to form the graph structure, pointing backward from children to parents.
        *   `_op`: A string indicating the mathematical operation (e.g., `'+'`, `'*'`, `'tanh'`) that produced this `Value` object from its parents in `_prev`. Useful for visualization and debugging.
        *   `_backward`: An internal *function* specific to this `Value` object. This function encapsulates the *local* chain rule logic for the operation (`_op`) that created this node. It knows how to take the gradient accumulated in *this* node (`self.grad`) and distribute it back to its parent nodes (those in `_prev`).
        *   `label` (optional): A string name for the node, helpful for readability.

2.  **Operations (Methods on `Value`):**
    *   Mathematical operations like addition (`+`, `__add__`), multiplication (`*`, `__mul__`), exponentiation (`**`, `__pow__`), `tanh()`, `relu()`, `exp()`, `log()`, `sin()`, etc., are implemented as methods or by overloading standard Python operators for the `Value` class.
    *   When an operation is performed (e.g., `c = a + b`), it:
        *   Creates a *new* `Value` object (`c`).
        *   Calculates `c.data`.
        *   Sets `c._prev = {a, b}` and `c._op = '+'`.
        *   Crucially, it defines and assigns the appropriate `_backward` function to `c`.

3.  **Backpropagation Execution (`backward()` method):**
    *   This is a *public* method called on the *final* `Value` object of the graph (typically the scalar loss, `L`).
    *   Its role is to orchestrate the entire backward pass over the *whole graph* leading to that node.
    *   **Steps:**
        1.  **Topological Sort:** It first performs a topological sort of the computation graph, starting from the final node and traversing backward through the `_prev` links to identify all preceding nodes in an order that respects dependencies (parents before children when viewed backward).
        2.  **Initialize Gradient:** It sets the gradient of the final node (`self.grad`) to `1.0` (since `dL/dL = 1`).
        3.  **Iterate and Propagate:** It iterates through the topologically sorted nodes in *reverse* order. For each node in this sequence, it calls that node's internal `_backward()` function.

4.  **Contrast: `backward()` vs. `_backward()`**
    *   `backward()`: The **orchestrator**. Called *once* on the final node. Manages the *global* process of backpropagation across the *entire* graph (performs topological sort, initializes `L.grad=1`, iterates).
    *   `_backward()`: The **local worker**. An *internal* function, *specific* to each node, defined by the operation that created it. It implements the *local* chain rule for *one specific operation*. It takes the gradient already computed for the *output* node (`self.grad`) and uses the local derivatives of the operation to calculate and *add* the gradient contributions to the `grad` attributes of the *input* nodes (`parent.grad += ...`). It is *called* repeatedly by the main `backward()` method during its iteration.

**Remark: Implicit Backward Pass for Composed Operations**

You might notice below that operations like division (`__truediv__`), subtraction (`__sub__`), or negation (`__neg__`) don't explicitly define their own `_backward` function within their implementation.

This is because these operations are implemented by **composing** more primitive operations that *do* have `_backward` defined.

For example, division `a / b` is implemented as `a * (b**-1)`:

1.  `b**-1`: Creates an intermediate `Value` node using the `__pow__` operation. This node gets the `_backward` logic associated with exponentiation.
2.  `a * (intermediate_node)`: Creates the final `Value` node using the `__mul__` operation. This node gets the `_backward` logic associated with multiplication.

When `backward()` is called on the result of the division, the topological sort includes these intermediate nodes. The main `backward()` loop then calls the respective `_backward` functions for `__mul__` and `__pow__` on the relevant nodes in the correct order. The chain rule is thus applied correctly through these constituent steps, automatically handling the gradient propagation for the composite division operation without needing an explicit `_backward` function defined within `__truediv__` itself. This composition simplifies the implementation, as complex derivatives are built up from simpler ones.

In [None]:
import math

class Value:
    """
    Stores a single scalar value and its gradient. Enables automatic differentiation
    by tracking the operations that created it (building a computation graph)
    and implementing the chain rule for gradient propagation during the backward pass.

    Attributes:
        data (float): The numerical value stored in this node.
        grad (float): The gradient of the final scalar output (often the loss L)
                      with respect to this node's value. Initialized to 0.
        _backward (callable): A function that computes the gradient contribution
                              of this node to its children (inputs). This is
                              defined by the operation that created this node.
        _prev (set): A set containing the 'Value' objects that were inputs to
                     the operation that created this 'Value' object (its parents
                     in the computation graph).
        _op (str): A string representation of the operation that created this
                   node (e.g., '+', '*', 'tanh'). Useful for debugging/visualization.
        label (str): An optional label for the node, helpful for diagrams.
    """

    def __init__(self, data, _children=(), _op='', label=''):
        """
        Initializes a Value object.

        Args:
            data (float or int): The numerical data for this value.
            _children (tuple): Internal argument. Tuple of parent Value objects
                               that produced this Value.
            _op (str): Internal argument. The operation that produced this Value.
            label (str): Optional descriptive label for this value.
        """
        self.data = float(data)
        self.grad = 0.0  # Initialize gradient to zero

        # Internal variables used for building the computation graph & backprop
        self._backward = lambda: None  # Default: leaf nodes have no backward function
        self._prev = set(_children)    # Set of input Value objects (parents)
        self._op = _op                 # Operation that produced this node
        self.label = label

    def __repr__(self):
        """Provides a readable representation of the Value object."""
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

    def __add__(self, other):
        """
        Overloads the '+' operator for Value objects.

        Handles addition with another Value object or a constant (int/float).
        """
        other = other if isinstance(other, Value) else Value(other) # Wrap constants
        out = Value(self.data + other.data, (self, other), '+')

        def _backward():
            # Gradient of '+' is 1 for both inputs. Apply chain rule:
            # dL/dself = dL/dout * dout/dself = dL/dout * 1
            # dL/dother = dL/dout * dout/dother = dL/dout * 1
            self.grad += out.grad * 1.0
            other.grad += out.grad * 1.0
        out._backward = _backward

        return out

    def __mul__(self, other):
        """
        Overloads the '*' operator for Value objects.

        Handles multiplication with another Value object or a constant (int/float).
        """
        other = other if isinstance(other, Value) else Value(other) # Wrap constants
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            # Gradient of '*' uses the product rule (applied via chain rule):
            # dL/dself = dL/dout * dout/dself = dL/dout * other.data
            # dL/dother = dL/dout * dout/dother = dL/dout * self.data
            self.grad += out.grad * other.data
            other.grad += out.grad * self.data
        out._backward = _backward

        return out

    def __pow__(self, other):
        """
        Overloads the '**' operator for Value objects (raising self to a power).

        Args:
            other (int or float): The exponent (must be a constant).
        """
        assert isinstance(other, (int, float)), "Exponent must be int/float for this simple version"
        out = Value(self.data**other, (self,), f'**{other}')

        def _backward():
            # Gradient rule: d(x^n)/dx = n * x^(n-1)
            # dL/dself = dL/dout * dout/dself = dL/dout * (other * self.data**(other - 1))
            self.grad += out.grad * (other * self.data**(other - 1))
        out._backward = _backward

        return out

    def __truediv__(self, other):
        """
        Overloads the '/' operator (true division). Implemented as self * (other**-1).
        """
        return self * (other**-1)

    def __neg__(self):
        """Overloads the unary '-' operator (negation). Implemented as self * -1."""
        return self * -1

    def __sub__(self, other):
        """Overloads the binary '-' operator (subtraction). Implemented as self + (-other)."""
        return self + (-other)

    # --- Reflected operators for handling constants on the left ---

    def __radd__(self, other): # other + self
        """Handles addition when a constant is on the left (e.g., 2 + Value(3))."""
        return self + other

    def __rmul__(self, other): # other * self
        """Handles multiplication when a constant is on the left (e.g., 2 * Value(3))."""
        return self * other

    def __rsub__(self, other): # other - self
        """Handles subtraction when a constant is on the left (e.g., 2 - Value(3))."""
        return Value(other) + (-self) # Convert other to Value first

    def __rtruediv__(self, other): # other / self
        """Handles division when a constant is on the left (e.g., 2 / Value(3))."""
        return Value(other) * (self**-1) # Convert other to Value first

    # --- Activation and other mathematical functions ---

    def exp(self):
        """Applies the exponential function (e^x)."""
        x = self.data
        out = Value(math.exp(x), (self,), 'exp')

        def _backward():
            # Gradient rule: d(e^x)/dx = e^x
            # dL/dself = dL/dout * dout/dself = dL/dout * exp(self.data) = dL/dout * out.data
            self.grad += out.grad * out.data
        out._backward = _backward

        return out

    def log(self):
        """Applies the natural logarithm (ln(x))."""
        x = self.data
        if x <= 0:
            # Avoid math domain error and gradient issues with log(0) or log(negative)
            # In a real library, might add a small epsilon or handle differently.
            raise ValueError("Logarithm undefined for non-positive values.")
        out = Value(math.log(x), (self,), 'log')

        def _backward():
            # Gradient rule: d(ln(x))/dx = 1/x
            # dL/dself = dL/dout * dout/dself = dL/dout * (1 / self.data)
            self.grad += out.grad * (1.0 / self.data)
        out._backward = _backward

        return out

    def sin(self):
      """Applies the sine function (sin(x), assumes radians)."""
      x = self.data
      out = Value(math.sin(x), (self,), 'sin')

      def _backward():
          # Gradient rule: d(sin(x))/dx = cos(x)
          # dL/dself = dL/dout * dout/dself = dL/dout * cos(self.data)
          self.grad += out.grad * math.cos(self.data)
      out._backward = _backward

      return out

    def tanh(self):
        """Applies the hyperbolic tangent activation function."""
        x = self.data
        t = math.tanh(x)
        out = Value(t, (self,), 'tanh')

        def _backward():
            # Gradient rule: d(tanh(x))/dx = 1 - tanh(x)^2
            # dL/dself = dL/dout * dout/dself = dL/dout * (1 - out.data**2)
            self.grad += out.grad * (1 - out.data**2)
        out._backward = _backward

        return out

    def relu(self):
        """Applies the Rectified Linear Unit (ReLU) activation function."""
        out = Value(max(0, self.data), (self,), 'ReLU')

        def _backward():
            # Gradient rule: d(ReLU(x))/dx = 1 if x > 0, 0 otherwise
            # dL/dself = dL/dout * dout/dself = dL/dout * (1 if self.data > 0 else 0)
            # Note: Using out.data > 0 is also common and equivalent here.
            self.grad += out.grad * (1.0 if self.data > 0 else 0.0)
        out._backward = _backward

        return out

    def sigmoid(self):
      """Applies the sigmoid activation function."""
      x = self.data
      # Stable sigmoid implementation:
      if x >= 0:
          z = math.exp(-x)
          s = 1 / (1 + z)
      else:
          z = math.exp(x)
          s = z / (1 + z)
      out = Value(s, (self,), 'sigmoid')

      def _backward():
          # Gradient rule: d(sigmoid(x))/dx = sigmoid(x) * (1 - sigmoid(x))
          # dL/dself = dL/dout * dout/dself = dL/dout * (out.data * (1 - out.data))
          self.grad += out.grad * (out.data * (1 - out.data))
      out._backward = _backward

      return out


    # --- Backpropagation ---

    def backward(self):
        """
        Performs the backward pass (reverse-mode automatic differentiation)
        starting from this Value object. It computes the gradients of all
        nodes in the computation graph that led to this node.

        Assumes this node is the final scalar output (e.g., loss L).
        """

        # Step 1: Build a topologically sorted list of nodes
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v) # Add node *after* its children are processed

        build_topo(self)

        # Step 2: Initialize the gradient of the final node (self) to 1.0
        #         (dL/dL = 1). Gradients of all other nodes are already 0.
        self.grad = 1.0

        # Step 3: Iterate through the nodes in reverse topological order
        #         and apply the chain rule using the _backward functions.
        for node in reversed(topo):
            node._backward() # This calls the specific backward function defined
                             # by the operation that created 'node'.

In [None]:
# --- Example Usage (Matches the markdown example) ---

print("--- Running Example from Markdown ---")

# Input values
x1 = Value(2.0, label='x1')
x2 = Value(3.0, label='x2')
x3 = Value(4.0, label='x3')

# Build the computation graph: L = log(x1*x2) + sin(x1*x3) + x2*x3
# Let's break it down step-by-step like in the markdown

# a = x1 * x2
a = x1 * x2; a.label = 'a'
# b = log(a)
b = a.log(); b.label = 'b'

# c = x1 * x3
c = x1 * x3; c.label = 'c'
# d = sin(c)
d = c.sin(); d.label = 'd'

# e = x2 * x3
e = x2 * x3; e.label = 'e'

# f = b + d
f = b + d; f.label = 'f'
# L = f + e
L = f + e; L.label = 'L'

# --- Forward Pass Verification ---
print(f"Forward Pass Result L.data: {L.data:.4f}")
# Expected: log(2*3) + sin(2*4) + 3*4 = log(6) + sin(8) + 12
# Expected: ~1.7918 + 0.9894 + 12 = 14.7812
print("---")

# --- Backward Pass ---
print("Running Backward Pass (L.backward())...")
L.backward()
print("---")

# --- Check Gradients ---
print("Gradients after backpropagation:")
print(f"{x1.label}: {x1}") # Expected: ~ -0.0819
print(f"{x2.label}: {x2}") # Expected: ~ 4.3334
print(f"{x3.label}: {x3}") # Expected: ~ 2.7090
print("---")

# --- You can also inspect intermediate gradients if needed ---
print("Intermediate node gradients:")
print(f"{a.label}: {a}") # dL/da = dL/db * db/da = b.grad * (1/a.data) = 1.0 * (1/6) = 0.1667
print(f"{b.label}: {b}") # dL/db = dL/df * df/db = f.grad * 1 = 1.0 * 1 = 1.0
print(f"{c.label}: {c}") # dL/dc = dL/dd * dd/dc = d.grad * cos(c.data) = 1.0 * cos(8) ~ -0.1455
print(f"{d.label}: {d}") # dL/dd = dL/df * df/dd = f.grad * 1 = 1.0 * 1 = 1.0
print(f"{e.label}: {e}") # dL/de = dL/dL * dL/de = L.grad * 1 = 1.0 * 1 = 1.0
print(f"{f.label}: {f}") # dL/df = dL/dL * dL/df = L.grad * 1 = 1.0 * 1 = 1.0
print(f"{L.label}: {L}") # dL/dL = 1.0 (by definition)

In [None]:
print("\n--- Testing other operations ---")
v1 = Value(2.0, label='v1')
v2 = Value(5.0, label='v2')

# Test division: y = v1 / v2 = 2.0 / 5.0 = 0.4
y_div = v1 / v2; y_div.label = 'y_div'
y_div.backward()
print(f"Division: {y_div}")
# dy/dv1 = 1/v2 = 1/5 = 0.2
# dy/dv2 = -v1 / v2^2 = -2 / 25 = -0.08
print(f" Gradient {v1.label}: {v1.grad:.4f} (Expected: 0.2)")
print(f" Gradient {v2.label}: {v2.grad:.4f} (Expected: -0.08)")
v1.grad = 0; v2.grad = 0 # Reset grads for next test

# Test exponentiation: y = v1**3 = 2.0**3 = 8.0
y_pow = v1**3; y_pow.label = 'y_pow'
y_pow.backward()
print(f"Power: {y_pow}")
# dy/dv1 = 3 * v1^2 = 3 * 2^2 = 12
print(f" Gradient {v1.label}: {v1.grad:.4f} (Expected: 12.0)")
v1.grad = 0 # Reset grad

# Test ReLU: y = relu(-3.0) = 0
v_neg = Value(-3.0, label='v_neg')
y_relu1 = v_neg.relu(); y_relu1.label = 'y_relu1'
y_relu1.backward()
print(f"ReLU (negative input): {y_relu1}")
print(f" Gradient {v_neg.label}: {v_neg.grad:.4f} (Expected: 0.0)")
v_neg.grad = 0

# Test ReLU: y = relu(4.0) = 4.0
v_pos = Value(4.0, label='v_pos')
y_relu2 = v_pos.relu(); y_relu2.label = 'y_relu2'
y_relu2.backward()
print(f"ReLU (positive input): {y_relu2}")
print(f" Gradient {v_pos.label}: {v_pos.grad:.4f} (Expected: 1.0)")
v_pos.grad = 0

# Test Sigmoid: y = sigmoid(0.0) = 0.5
v_zero = Value(0.0, label='v_zero')
y_sig = v_zero.sigmoid(); y_sig.label = 'y_sig'
y_sig.backward()
print(f"Sigmoid (zero input): {y_sig}")
# dy/dv_zero = sig(0)*(1-sig(0)) = 0.5*(1-0.5) = 0.25
print(f" Gradient {v_zero.label}: {v_zero.grad:.4f} (Expected: 0.25)")
v_zero.grad = 0

### Visualization with Graphviz
Following Karpathy's example, we can visualize the computation graph using Graphviz. The `graphviz` library allows us to create directed graphs in a simple and intuitive way.

To install it, try `conda install graphviz` or `pip install graphviz`.

In [None]:
import graphviz # Make sure this is installed

def trace(root):
    """
    Builds a set of all nodes and edges in a computation graph starting
    from a root Value object.

    Args:
        root (Value): The final node of the graph to trace back from.

    Returns:
        tuple: A tuple containing two sets:
               - nodes (set): All Value objects in the graph.
               - edges (set): Tuples representing connections (parent_value, child_value).
    """
    nodes, edges = set(), set()
    visited = set()
    def build(v):
        if v not in visited:
            visited.add(v)
            nodes.add(v)
            for parent in v._prev:
                edges.add((parent, v)) # Edge points from parent to child
                build(parent)          # Recurse on parents
    build(root)
    return nodes, edges

def draw_dot(root, format='svg', rankdir='LR'):
    """
    Generates a Graphviz visualization of the computation graph ending at 'root'.

    Args:
        root (Value): The root node of the graph (e.g., the final loss L).
        format (str): The output format for Graphviz ('svg', 'png', 'pdf', etc.).
        rankdir (str): The direction of the graph layout ('LR' for left-to-right,
                       'TB' for top-to-bottom).

    Returns:
        graphviz.Digraph: The Graphviz object representing the graph.
                          You can render this in Jupyter or save it to a file.

    Example Usage (in Jupyter):
        # Assume L is the final Value object from your calculation
        dot_graph = draw_dot(L)
        dot_graph # This will display the graph in the notebook output cell
    """
    assert rankdir in ['LR', 'TB']
    nodes, edges = trace(root) # Get all unique nodes and parent->child connections
    dot = graphviz.Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'shape': 'record'})

    for n in nodes:
        # Use object's id as a unique identifier for the node in Graphviz
        uid = str(id(n))
        # Create a node for the Value object itself
        # Use 'record' shape for multi-line labels (label | data | grad)
        node_label = f"{{ {n.label+' | ' if n.label else ''}data {n.data:.4f} | grad {n.grad:.4f} }}"
        dot.node(name=uid, label=node_label, shape='record')

        if n._op: # If this Value node was created by an operation
            # Create a small, distinct node representing the operation
            op_uid = uid + n._op # Unique ID for the operation node
            dot.node(name=op_uid, label=n._op, shape='ellipse') # Use ellipse or circle for ops
            # Add an edge from the operation node to the Value node it created
            dot.edge(op_uid, uid)
            # Add edges from the parent Value nodes to this operation node
            for parent in n._prev:
                parent_uid = str(id(parent))
                dot.edge(parent_uid, op_uid)

    return dot

In [None]:
# --- Example Usage with the previous calculation ---
# Make sure the Value class definition and the example calculation
# from the previous steps have been executed first.

# Assuming 'L' is the final node from: L = log(x1*x2) + sin(x1*x3) + x2*x3
print("\n--- Generating Computation Graph Visualization ---")
# Calculate L again if necessary
x1 = Value(2.0, label='x1')
x2 = Value(3.0, label='x2')
x3 = Value(4.0, label='x3')
a = x1 * x2; a.label = 'a'
b = a.log(); b.label = 'b'
c = x1 * x3; c.label = 'c'
d = c.sin(); d.label = 'd'
e = x2 * x3; e.label = 'e'
f = b + d; f.label = 'f'
L = f + e; L.label = 'L'
L.backward() # Run backward pass to populate gradients for display

# Generate the graph object
dot_graph = draw_dot(L)

# To save to a file (e.g., SVG):
# try:
#     dot_graph.render('computation_graph', view=False) # Saves computation_graph.gv and computation_graph.gv.svg
#     print("Graph saved to computation_graph.gv.svg")
# except Exception as e:
#     print(f"Could not save graph. Ensure Graphviz executable is installed and in PATH. Error: {e}")

dot_graph # Display the graph in the notebook output cell