# Team04 Final Project: AutoDiff Package
 
**Team member**: Yixian Gan, Siyao Li, Ting Cheng, Li Yao, Haitian Liu

## Introduction
**Automatic Differentiation (AutoDiff)** package is a python package that realizes forward mode automatic differentiation method on custom input functions. 

In scientific research or engineering projects, sometimes we would want to compute the derivative of certain functions (For example, the $f'(x)$ term in Newton's method.) For simple input functions, we can compute an exact analytical solution with ease. However, once the inputs become complicated, it may be hard or even impossible to calculate an analytical solution. This problem becomes especially intractable in deep learning, where we are interested in the derivative of model losses with respect to input features, both of which could be vectors with hundreds of dimensions.

An alternative way is to compute the derivative using numerical methods like automatic differentiation. It breaks down large, complex input function into the product of elementary functions, whose derivatives are trivial to compute. By tracing the gradient of intermediate results and repeatedly applying Chain Rule, AutoDiff is able to compute the gradient of any input function in a certain direction. This carries significant importance as almost all machine learning methods rely on gradient descent, and the absolute prerequisite of gradient descent is to compute the gradient.    



## Background

*This section provides a brief overview of the mechanism of AD. Users not interested in the math may skip to* **How to use *AutoDiff*** *section below*

- **Elementary Operation**

  The key concept of AD is to break down a complicated function into small, managable steps, and solving each step individually. Typically, each step in AD would only perform one elementary operation. Here, "Elementary Operations" refer to both arithmetic operations (`+`, `-`, `*`, scalar division, power operation, etc.), and elementary functions (`exp`, `log`, `sin `, `cos`, etc.) These elementary operations should take only one or two inputs, and its partial derivative with respect to both inputs should be easy to compute. We would later chain these intermediate derivatives to get the overall result.
  
- **Chain Rule**

  Chain rule in calculus is the rule to compute the derivative of compound functions. It allows us to write the derivative of compound function as the product of derivatives of simple functions. The simpliest case is taking the derivative of a scalar function of only one scalar variable. 
  
  $$\frac{d}{dx}f(u(x))=\frac{df(u)}{du}\frac{du(x)}{dx}$$
  
  A more general case is to have a function $f$ of a n-dimension vector variable $\textbf{x}=(x_1, x_2, ...x_n)$. Then, instead of derivate, we would like to compute the gradient of $f$ with respect to $\textbf{x}$. Suppose $f$ is a function of vector $\textbf{y}$, which itself is a function of vector $\textbf{x}$. The chain rule for multivariate function is given by 
  
  $$\nabla_xf=\frac{\partial f}{\partial y_1}\nabla_xy_1+\frac{\partial f}{\partial y_2}\nabla_xy_2+... $$
  $$=\sum_i \frac{\partial f}{\partial y_i}\nabla_xy_i$$
  
  The chain rule is exceptionally useful in AD method as we can imagine $\textbf{y}$'s to be the intermediate result at each step, then by chain rule, the gradient of the interested funtion is just the production of gradients calculated in each small step. 

- **Directional Derivative** $D_p$
 
    An intuitive way to think of gradient is the direction in the n-dimensional space in which the function $f(x_1, x_2, ...,x_n)$ increases the fastest. For a function of a n-dimensional variable $\textbf{x}$, its gradient is also a n-dimensional vector. Therefore, storing the gradient of every intermediate result in AD can be computationally costly (there might be millions of intermediate results in some complicated computations!) A remedy to this is to store the directional derivatives instead. The intuition behind directional derivative is that instead of the direction of steepest ascending, we would calculate the ascending rate along a certain direction of interest. Mathematically, the directional derivative of $f(\textbf{x})$ in direction $\textbf{p}$ is defined as the *projection* of gradient of $f$ on direction $\textbf{p}$.
 
    $$D_{\textbf{p}}f=\nabla_xf\cdot \textbf{p}$$
 
    Therefore, instead of the gradient of each intermediate result, we would store only the directional derivative of each intermediate result. These directional derivatives are dot products of vectors, so they are all scalars themselves, which are much more efficient to store.
 
- **Computational Graph**
 
    A computational graph is just a directed graph that describes how to break down the complicated function into elementary operations, and what are the intermediate values to be computed. The vertices in the computational graph are intermediate values, and the edges are elementary operations. An edge from $v_1$ to $v_2$ means to perform a certain elementary operation on intermediate value $v_1$ to get the next intermediate value $v_2$.
 
- **Trace**
 
    Traces simply mean the values we would like to keep track of in the forward pass in AD. For forward-mode AD, which is the backbone of this project, there are two traces, *Primal Trace* and *Tangent Trace*.
 
    **Primal trace** stores the elementary operation to get one intermediate value from previous results.
 
    For example $f(x)=e^{-\sin(x)}$, its primal trace is then
 
    $$v_0=x$$
    $$v_1=\sin(v_0)$$
    $$v_2=-v_1$$
    $$v_3=exp(v_2)$$
 
    Primal trace provides the recipe for each intermediate value and eventually leads us to the final answer.
 
    **Tangent trace** stores the *directional derivative* of an intermediate value. Thanks to Chain Rule, the tangent trace of $v_j$ can be written as the product of $\frac{dv_j}{dv_i}D_pv_i$, where $v_i$ is some other intermediate value from which $v_j$ is computed.
 
    Using the same example as before, the tangent trace of $f(x)$ is
 
    $$D_pv_0=1$$
    $$D_pv_1=\frac{dv_1}{dv_0}D_pv_0=\frac{d\sin(v_0)}{dv_0}D_pv_0=\cos(v_0)D_pv_0$$
    $$D_pv_2=\frac{d}{dv_1}(-v_1)D_pv_1=-D_pv_1$$
    $$D_pv_3=\frac{d}{dv_2}exp(v_2)D_pv_2=exp(v_2)D_pv_2$$

- **Dual Number**

    Dual numbers are expressions with of the form $x = a + b\varepsilon$, where $a, b \in R$, with selected $\varepsilon$ such that $\varepsilon^2 = 0$ while $\varepsilon \neq 0$. Dual Numbers have desirable properties that will later become useful for calculating derivatives.<br>
    Given any real polynomial $P(x) = p_0 + p_1x + p_2x^2 + \dots + p_nx^n$, let $x = a + b\varepsilon$, 
    $$P(a + b\varepsilon)= p_0 + p_1(a + b\varepsilon) + \cdots + p_n(a + b\varepsilon)^n$$
    Since $\varepsilon^2 = 0$, all $p_i\varepsilon^i = 0$ for any $i \in [0,n]$
    $$P(a + b\varepsilon)= p_0 + p_1a + p_2a^2 + \cdots + p_na^n + p_1 b\varepsilon + 2 p_2 a b\varepsilon + \cdots + n p_n a^{n-1} b\varepsilon$$
    $$P(a + b\varepsilon)= P(a) + bP'(a)\varepsilon$$
    We can use Taylor series of $f(x)$ expanding around $c = a + 0\varepsilon$ to generalize the idea, 
$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(c)}{n !}(x-c)^n = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n !}(b\varepsilon)^n = f(a) + b\varepsilon f'(a)$$
    The first term is refered as primal trace and the latter term is refered as tangent trace, which are discussed above and crucial to our process of AD. 



## How to use *AutoDiff*

- Install Package from PyPI
    - MacOS / Linux
    
        ```bash
        $ python -m pip install AutoDiff
        ```

    - Windows

        ```bash
        $ py -m pip install AutoDiff
        ```

- Import module

    ```python
    import AutoDiff as AD
    ```

- `DualNumber` class

    * Initialization

        Let's assume we have values `real` and `dual`, to create a dual object:

        ```python
        import AutoDiff.util import DualNumber
        dual_number = DualNumber(real, dual)  
        ```
    
    * Operators
    
        1. Arithmetic Operators
        
            The following arithmetic operators are supported with DualNumber:
            - `+`
            - `-`
            - `*`
            - `/` 
            - `exp()`
            
            Example:

            ```python
                n_one = DualNumber(real_one, dual_one) 
                n_two = DualNumber(real_two, dual_two)
                
                add_result = n_one + n_two
                minus_result = n_one - n_two
                multiply_result = n_one * n_two
                division_result = n_one / n_two
                Exp_result = n_one.exp(2)
            ```
    
        2. Trigonometry Operators

            The following trigonometry operators are supported with DualNumber:
            - `sin()`
            - `cos()`
            - `tan()`
            
            Example:

            ```python
                import AutoDiff as ad
                n_one = DualNumber(real_one, dual_one) 
                
                sin_result = ad.sin(n_one)
                cos_result = ad.cos(n_one)
                tan_result = ad.tan(n_one)
            ```
- `Expression` class

    * Initialization

        Note that `Expression` class serves as a base class for `Variable` and `Function` class. Initializing a Expression object is not supported. Users should not be creating instances of Expression class.
    
    * Operators

        The classes inherited from the Expression class will be able to use the following operators.

        1. Arithmetic Operators
            - `+`
            - `-`
            - `*`
            - `/` 
            - `exp()`
    
        2. Trigonometry Operators
            - `sin()`
            - `cos()`
            - `tan()`
        
        3. Differentiation Operators
            - `deriv()`


- `Variable` class

    * Initialization

        During the initialization of a Variable object, only the `name` is specified. Users don't have to know the value of a variable ahead of time. This allows us to create a general form of expression.

        ```python
            import AutoDiff.expression import Variable
            
            x = Variable('x')
        ```
    
    * Set value

        To set the value of a Variable object, simply call the Variable object with the desired input. Input should be a iterable that contains two numbers representing real and dual part of a variable.
        
        A DualNumber object is constructed returned to user as a representation of the input.

        Example:

        ```python
            x = Variable('x')
            dual_number = x([real, dual])
        ```

- `Function` class

    * Initialization
        
        1. ``` __init__(self, exp1, exp2, op)```

            The constructor has three arguments: exp_1, exp_2, and operator. exp1 and exp2 are two Expression object that could be either a `Variable` or `Function`. op is a supported operator specifies in the next section.

            Example:

            ```python
                import AutoDiff.expression import Variable, Function
                
                x, y = Variable(‘x’), Variable(‘y’)
                func1 = Function(dual_number_1, dual_number_2, Function.multiply)
                func2 = Function(func1, dual_number_3, Function.minus)
            ```

        2. We can also create function class with the opeators specified in the Expression's Operator section.

            Example:

            ```python
                import AutoDiff as ad
                import AutoDiff.expression import Variable, Function
                
                x, y = Variable(‘x’), Variable(‘y’)
                function = x + y
            ```

            Here's a more complicate example:

            ```python
                x, y = Variable('x'), Variable('y')
                function = ad.sin(x*4) + ad.cos(y*4)
                function = function + ad.exp(x*y)
                function.deriv({x: 1, y: 2}, seed={x:1, y:0})
            ```
    
    * Evaluation

        To evaluate a function, simply call on the function with a input dictionary and the result will be returned.
    

        Example:

        ```python
            func1 = Function(x, y, Function.multiply)
            result = func1({x: 1, y: 2}, seed={x:1, y:0})
        ```

## Software Organization

***Directory Structure:***
    
```  
    team04/
    ├── docs/
    │   ├── milestone1.ipynb
    │   ├── dual.md
    │   └── expression.md
    ├── src/
    │   ├── __init__.py
    │   ├── dual.py
    │   └── expression.py
    ├── tests/
    │   ├── util/
    │   └── expression/
    ├── LICENSE
    ├── README.md
    ├── pyproject.toml
    └── .gitignore
```

- ***team04/***

    This is the project's **root folder**, which contains README file, license, .gitignore, and other sub-directories of source code, tests, and documentations.

- ***team04/docs/***  

    The **team04/docs/** directory contains the **documentation** that explains the usage of the classes and functions defined in this project. In addition, our milestone progress is also stored in this folder.

- ***team04/src/***  

    The **team04/src/** directory contains all the **source code**. Our current plan is to include two modules, util and expression. Util module provides a DualNumber class, which would carry out the actual computation in expression evaluation. Expression module provides support for function declaration and evaluation. Please seed the **Implementation** section below for more details. 

- ***team04/tests/***  

    The **team04/tests/** contains the **unit and integration tests** of this project, ensuring the project's proper functioning before release.


***Distribution***:

We plan to distribute AutoDiff on PyPI with PEP518. User will be able to install our package on command line with 

```bash
    $ python -m pip install
```

User will also be able to clone or download the copy of the source code on our github repository. To contribute to the project, users can fork and create PR to request changes in the future.

***Note***:

We are still deciding whether to support two modes for our operation, which are forward and backward mode. It is not yet reflected in the organization above.

## Implementation

Currently, we plan to implement two core modules, `util` and `expression`. The Python libraries we plan to use in our project is NumPy and pytest. 

### `util`

`util` module for now includes a single class called `DualNumber` and will be used to compute the primal and tangent trace in the computational graph. We plan to overload its `__add__` and `__mul__` dunder methods, and further implement methods like `_sin(), _cos(), _exp(), _log()` to support these elementary functions on dual numbers. `DualNumber` is the absolute foundation of our AD package and is planned to be implemented and comprehensively tested first.

### `expression`

Another module, which is also the core of our AD package, is `expression`. Essentially, all functions would be treated as some type of expressions. We imagine including an abstract base class `Expression` in this module and two children classes, `Variable` and `Function` that inherit from it. 

#### `Expression`

  This should be an abstract base class, and is not intended to be initialized directly. Its children classes, `Variable` and `Function` carry out the real work.
  
#### `Variable`

  Users are expected to declare the variables in their function via this class. An instance of `Variable` object would have two attributes, `name` and `value`. `name` is a string that gives a variable its name. Typically this would just be 'x' or 'y', but other names are also allowed. Variables can be either list of variables or a list of functions, where function with a list of m variables is a map from m-dimensional space to 1-dimensional space, while the function with list of m functions (each function is a function of n variables) is a map from m-dimensional space to n-dimensional space. The vector calculus version of AD can be implemented in the later milestones. The name of a variable should be determined at initialization and is expected to be unique. `value` attribute stores the true value of that variable in the format of a `DualNumber`. The value of a variable is not required at initialization. The program would evaluate a given variable during evaluation. We would also overload the dunder `__add__` and `__mul__` methods of `Variable` class so that these methods would return an instance of `Function`.
  
  - **Evaluation**
  
    We decided to separate function declaration and evaluation so that users do not need to determine which point they want to calculate the gradient and what seed vector they want to use. We plan to overload the dunder `__call__` method of `Variable` class to make it callable. The overloaded `__call__` method would accept two `dict` as inputs, first of which specified the point to evaluate the function, and the other specified the seed vector. An ideal usage should be like this
    
    ```python
    x = Variable('x')
    x(inputs={'x': 1}, seed={'x': 1})
        
    ```
    *Note: We may also allow input and seed to be a scalar to make the syntax more succinct in the case of simple functions. This could be one of the future extension.*
    
    On calling, the `__call__` method would find the appropriate value from `input`, according to its name. It will also look for its name in the `seed` dictionary. Then, the value of this variable is determined and stored as an instance of `DualNumber` whose real part is the value of variable, and dual part is its directional derivative. This dual number is then returned by `__call__`.

   

#### `Function`

  A `Function` performs some elementary operation on other `Expression`(s). At initialization, a `Function` object would take three parameters `exp1`, `exp2`, and `op`, where `exp2` is optional. `exp1` and `exp2` can be either `Function` or `Variable` objects, and they define what should the current function applied to. `op` should be a function object that we would provide as static functions within the `Function` class. These are just elementary operations, including but not limited to `add`, `mul`, `exp`, `sin`, etc. It is not necessary (nor expected) for users to know these static functions. The overloaded dunder classes and several pre-defined elementary function classes that inherit from `Function` should take care of these details. 
  

  - **Evaluation**
  
    Similar to `Variable` class, `Function` class is also callable. On calling, the `__call__` function would recursively call the precursor expressions of current function, and then combine the result by calling the `op` static function defined before. 
    
    
Note that our implementation doesn't need a special graph node class to store the computational graph. By keeping track of precursor expressions within each `Function` object, our program constructs the computational graph automatically. Users should feel free to construct their own computational graphs for visualization if needed. The Algorithm we used to calculate AD is a recursive algorithm, which can be extended to a Backward mode AD or Backpropagation algorithm in future development.

Ideally, the usage of our package should be fairly simple and intuitive. An example could be similar to this

  ```python
  from AutoDiff import expression as e
  
  x1, x2, x3, x4 = e.Variable(['x1','x2','x3','x4'])
  f = (x1 ** 2 + 4 * x2) - 2 * e.Sin(x3*x4) * e.Exp(x2 / 2)
  inputs = {'x1': 1,
           'x2': 2,
           'x3': 3,
           'x4': 4}
  seed = {'x1': 1,
          'x2': 0
          'x3': 0,
          'x4': 0}
  f_val, f_grad = f(inputs, seed)
  ```
  
  
  

## Licensing

We choose **MIT License** since we would like to permit unrestricted use and distribution of our program, so the whole community can benefit from it without any legal obstructions. It is compatible with any other open-source licenses as well as closed-source, proprietary products. The MIT License is short and easy for people to understand while it perfectly fits our needs. 

## Feedback 


- **Introduction** 
  - Good job!
- **Background**
  - May have two more subsections providing background introductions to the Jacobian matrix and seed vectors. -**Fix**: Added introduction to the Jacobian matrix and seed vectors. 
- **How-to-use**
  - Some subsections include too many details that may be hidden from users. -**Fix**: Removed redundant implementation details from users.
  - Imagine you are the user, what is the information you want to know about this package? Just providing a table of available APIs, the inputs, the expected outputs is enough. -**Fix**: Provided a table of APIs, inputs and outputs
  - Evaluate a function, do we really need to specify what the elementary function is? -**Fix**: Updated the evaluation of function. 
  - subtract but not minus -**Fix**: Fixed typo. 
  - Function class: -**Fix**: Refined the option to create a function for users. 
    - Better just provide users the preferred option to create a function.
    - We can also create function class, -> create a Function object. Be careful when using these terms.
  - If I were a user without background knowledge of seed vector, can the package provide an easier interface for me to use without specifying seed? -**Fix**: Provided a default result if the user didn't specify seed vectors.  
  - Please fix the indentation in the example codes :) -**Fix**: Fixed the indentation in code blocks.
- **Software organization**
  - Good job!
- **Implementation** TODO 
  - The only library is NumPy? What about pytest? -**Fix**: Included pytest in the libraries. 
  - util
    - Are you going to implement _sin(), etc, in DualNumber class? Is that viable? -**Fix**:  As long as we concerned, it is doable to implement elementary functions the DualNumber class. Using the Taylor Expansion of the function, we can compute in this format, $f(z)=f(a) + f'(a)b\epsilon$. For example, $\sin(z)=\sin(a+b\epsilon)=\sin(a)+\cos(a)b\epsilon$. We should be able to compute this value and return a new object of DualNumber class that is initialized to it.
  - expression
    - Variable The wrapper class solution for handling vector inputs doesn't make much sense to me. Since you didn't mention what is the type of Expression.value, why can't the type be numpy.ndarray/list? 
    - The uniqueness of variable name attribute is a very interesting point. How are you going to implement the uniqueness checking? -**Fix**: Our package would not add restriction on the uniqueness of variables, since it should be possible to have two variables of the same name in two different functions. This point in the docs should serve as a reminder to users that if they assign same name to two variables, the former one would be shadowed and could result in unexpected bahaviors.
  - Function
    - Evaluation
      - Very good idea to separate the definition and initialization of a Variable. However why should a variable take a dictionary as call()'s parameter given that the Variable has already been defined with a name? -**Fix**: The docs is written so just to pertain consistency with the calls to functions. But yes, We agree that this looks redundent. The docs is now updated to show the simpler way to evaluate a variable. 
      - Why should the input and seed be a scalar? It really cannot support a vector Variable? I do agree that this can be extended in the future. -**Fix**: We think our current design should be able to support vector (or even matrix) inputs. We can check the shape of the variable in the __call__ function of Variable class.
      - On calling, the `__call__` method would find the appropriate value from input, according to its name. Do you mean a Variable called 'x' may have the probability to also represent a variable called 'y'? -**Fix**: This is not necessary when evaluating a variable. But we do want to keep the flexibility to specify the overall function in multiple steps, and probably adding more variables in later steps. For example, we hope that our users can do something like
    ```python
       x, y = Variable('x'), Variable('y')
       f1 = sin(x) * 5 + cos(y)
       f2 = exp(z + f1)
    ```
       - Then a function call like f2([1,2,3]) can be ambiguous as it not necessary that the user is specifing the value of varibales according to alphabetical order. And it is also cumbersome for users to keep track of the order by which all varaibles are declared. So we think that a dictionary should be a reasonalbe choice.
       - Note that our implementation doesn't need a special graph node class to store the computational graph. This is correct if you are going to only implement the forward mode. What do you think about the reverse mode -**Fix**: For reverse mode, a possible remedy for this could be to add an additional attribute to keep track of the successors of the current Function of Variable.
- In all, this part is very excellent and detailed. Really appreciate your work!
- **Licensing**
  - Good choice!
  
