# Team04 Final Project: AutoDiff Package
 
**Team member**: Yixian Gan, Siyao Li, Ting Cheng, Li Yao, Haitian Liu

## Introduction
**Automatic Differentiation (auto_diff)** package is a python package that realizes forward mode automatic differentiation method on custom input functions. 

In scientific research or engineering projects, sometimes we would want to compute the derivative of certain functions (For example, the $f'(x)$ term in Newton's method.) For simple input functions, we can compute an exact analytical solution with ease. However, once the inputs become complicated, it may be hard or even impossible to calculate an analytical solution. This problem becomes especially intractable in deep learning, where we are interested in the derivative of model losses with respect to input features, both of which could be vectors with hundreds of dimensions.

An alternative way is to compute the derivative using numerical methods like automatic differentiation. It breaks down large, complex input function into the product of elementary functions, whose derivatives are trivial to compute. By tracing the gradient of intermediate results and repeatedly applying Chain Rule, AutoDiff is able to compute the gradient of any input function in a certain direction. This carries significant importance as almost all machine learning methods rely on gradient descent, and the absolute prerequisite of gradient descent is to compute the gradient.    



## Background

*This section provides a brief overview of the mechanism of AD. Users not interested in the math may skip to* **How to use *AutoDiff*** *section below*

- **Elementary Operation**

  The key concept of AD is to break down a complicated function into small, manageable steps, and solve each step individually. Typically, each step in AD would only perform one elementary operation. Here, "Elementary Operations" refer to both arithmetic operations (`+`, `-`, `*`, scalar division, power operation, etc.), and elementary functions (`exp`, `log`, `sin `, `cos`, etc.) These elementary operations should take only one or two inputs, and its partial derivative with respect to both inputs should be easy to compute. We would later chain these intermediate derivatives to get the overall result.
  
- **Chain Rule**

  Chain rule in calculus is the rule to compute the derivative of compound functions. It allows us to write the derivative of compound function as the product of derivatives of simple functions. The simplest case is taking the derivative of a scalar function of only one scalar variable. 
  
  $$\frac{d}{dx}f(u(x))=\frac{df(u)}{du}\frac{du(x)}{dx}$$
  
  A more general case is to have a function $f$ of an n-dimension vector variable $\textbf{x}=(x_1, x_2, ...x_n)$. Note that, this is a scalar function where output is a real number. Then, instead of derivate, we would like to compute the gradient of $f$ with respect to $\textbf{x}$. Suppose $f$ is a function of vector $\textbf{y}$, which itself is a function of vector $\textbf{x}$. The chain rule for multivariate function is given by 
  
  $$\nabla_xf=\frac{\partial f}{\partial y_1}\nabla_xy_1+\frac{\partial f}{\partial y_2}\nabla_xy_2+... $$
  $$=\sum_i \frac{\partial f}{\partial y_i}\nabla_xy_i$$
  
  If we let the output be a real-value vector instead of a scalar, we will have the most general case. Let the function $f \colon \mathbb{R}^n \to \mathbb{R}^m$, the gradient is called the Jacobian matrix $J$, which is a $m$ x $n$ matrix such as it contains all the first-order partial derivatives. More specifically, we can write the Jacobian matrix such as 

  $$
  \mathbb{J}=\left[\begin{array}{ccc}
  \dfrac{\partial \mathbf{f}(\mathbf{x})}{\partial x_{1}} & \cdots & \dfrac{\partial \mathbf{f}(\mathbf{x})}{\partial x_{n}}
  \end{array}\right]=\left[\begin{array}{c}
  \nabla^{T} f_{1}(\mathbf{x}) \\
  \vdots \\
  \nabla^{T} f_{m}(\mathbf{x})
  \end{array}\right]=\left[\begin{array}{ccc}
  \dfrac{\partial f_{1}(\mathbf{x})}{\partial x_{1}} & \cdots & \dfrac{\partial f_{1}(\mathbf{x})}{\partial x_{n}} \\
  \vdots & \ddots & \vdots \\
  \dfrac{\partial f_{m}(\mathbf{x})}{\partial x_{1}} & \cdots & \dfrac{\partial f_{m}(\mathbf{x})}{\partial x_{n}}
  \end{array}\right]
  $$

  The Jacobian of a vector-valued function in several variables generalizes the gradient of a scalar-valued function in several variables. We can then apply the chain rule using matrix operations similar to what we described before. 
    
  The chain rule is exceptionally useful in AD method as we can imagine $\textbf{y}$'s to be the intermediate result at each step, then by chain rule, the gradient of the interested function is just the production of gradients calculated in each small step.   


- **Directional Derivative** $D_p$
 
    An intuitive way to think of gradient is the direction in the n-dimensional space in which the function $f(x_1, x_2, ...,x_n)$ increases the fastest. For a function of a n-dimensional variable $\textbf{x}$, its gradient is also an n-dimensional vector. Therefore, storing the gradient of every intermediate result in AD can be computationally costly (there might be millions of intermediate results in some complicated computations!) A remedy to this is to store the directional derivatives instead. The intuition behind directional derivative is that instead of the direction of steepest ascending, we would calculate the ascending rate along a certain direction of interest. Mathematically, the directional derivative of $f(\textbf{x})$ in direction $\textbf{p}$ is defined as the *projection* of gradient of $f$ on direction $\textbf{p}$.
 
    $$D_{\textbf{p}}f=\nabla_xf\cdot \textbf{p}$$
 
    Therefore, instead of the gradient of each intermediate result, we would store only the directional derivative of each intermediate result. These directional derivatives are dot products of vectors, so they are all scalars themselves, which are much more efficient to store.
    
    Formally, the vector $\textbf{p}$ is called the seed vector. It is a parameter that given by user, which we project the gradient in its direction. It is also preferable to have a unit length.
 
- **Computational Graph**
 
    A computational graph is just a directed graph that describes how to break down the complicated function into elementary operations, and what are the intermediate values to be computed. The vertices in the computational graph are intermediate values, and the edges are elementary operations. An edge from $v_1$ to $v_2$ means to perform a certain elementary operation on intermediate value $v_1$ to get the next intermediate value $v_2$.
 
- **Trace**
 
    Traces simply mean the values we would like to keep track of in the forward pass in AD. For forward-mode AD, which is the backbone of this project, there are two traces, *Primal Trace* and *Tangent Trace*.
 
    **Primal trace** stores the elementary operation to get one intermediate value from previous results.
 
    For example $f(x)=e^{-\sin(x)}$, its primal trace is then
 
    $$v_0=x$$
    $$v_1=\sin(v_0)$$
    $$v_2=-v_1$$
    $$v_3=exp(v_2)$$
 
    Primal trace provides the recipe for each intermediate value and eventually leads us to the final answer.
 
    **Tangent trace** stores the *directional derivative* of an intermediate value. Thanks to Chain Rule, the tangent trace of $v_j$ can be written as the product of $\frac{dv_j}{dv_i}D_pv_i$, where $v_i$ is some other intermediate value from which $v_j$ is computed.
 
    Using the same example as before, the tangent trace of $f(x)$ is
 
    $$D_pv_0=1$$
    $$D_pv_1=\frac{dv_1}{dv_0}D_pv_0=\frac{d\sin(v_0)}{dv_0}D_pv_0=\cos(v_0)D_pv_0$$
    $$D_pv_2=\frac{d}{dv_1}(-v_1)D_pv_1=-D_pv_1$$
    $$D_pv_3=\frac{d}{dv_2}exp(v_2)D_pv_2=exp(v_2)D_pv_2$$

- **Dual Number**

    Dual numbers are expressions with of the form $x = a + b\varepsilon$, where $a, b \in R$, with selected $\varepsilon$ such that $\varepsilon^2 = 0$ while $\varepsilon \neq 0$. Dual Numbers have desirable properties that will later become useful for calculating derivatives.<br>
    Given any real polynomial $P(x) = p_0 + p_1x + p_2x^2 + \dots + p_nx^n$, let $x = a + b\varepsilon$, 
    $$P(a + b\varepsilon)= p_0 + p_1(a + b\varepsilon) + \cdots + p_n(a + b\varepsilon)^n$$
    Since $\varepsilon^2 = 0$, all $p_i\varepsilon^i = 0$ for any $i \in [0,n]$
    $$P(a + b\varepsilon)= p_0 + p_1a + p_2a^2 + \cdots + p_na^n + p_1 b\varepsilon + 2 p_2 a b\varepsilon + \cdots + n p_n a^{n-1} b\varepsilon$$
    $$P(a + b\varepsilon)= P(a) + bP'(a)\varepsilon$$
    We can use Taylor series of $f(x)$ expanding around $c = a + 0\varepsilon$ to generalize the idea, 
$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(c)}{n !}(x-c)^n = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n !}(b\varepsilon)^n = f(a) + b\varepsilon f'(a)$$
    The first term is referred to as primal trace and the latter term is referred to as tangent trace, which are discussed above and crucial to our process of AD. 



## How to use *AutoDiff*

- Install Package from PyPI, We recommend that users to utilize Python 3.10 or newer.
    
    ```bash
    $ pip install -i https://test.pypi.org/simple/ auto-diff-2022-fall-team04==0.0.2
    ```

- You may also install the package directly from our GitHub repo. 

    1. Clone the folder from GitHub
    2. Change the directory into the cloned folder and run the following:
    
        ```bash
        $ pip install .
        ```

- `Expression` class

    * Initialization

        Note that `Expression` class serves as a base class for `Variable` and `Function` class. Initializing a Expression object is not supported. Users should not be creating instances of Expression class.
    
    * Operators

        The classes inherited from the Expression class will be able to use the following operators.

        1. Arithmetic Operators
        
            Users will be able to perform basic arithmetic operators of `+`, `-`, `*`, `/`, `**` on two Expression objects. 
        
        2. Custom Operators
            - `exp()`
            - `sin()`
            - `cos()`
            - `tan()`
            - `log()`
            - `sinh()`
            - `cosh()`
            - `tanh()` 
            - `arcsin()` 
            - `arccos()`
            - `arctan()`
            - `log_base()`
            - `sigmond()`
            - `sqrt()`
    
    * API
        * Exponential

            ```python
            from auto_diff_2022_fall_team04 import Expression
            Expression.exp(x: Expression) -> Function 
            ```

        * Sin
            ```python
            Expression.sin(x: Expression) -> Function
            ```

        * Cos
            ```python
            Expression.cos(x: Expression) -> Function
            ```

        * Tan
            ```python
            Expression.tan(x: Expression) -> Function
            ```

        * `__Call__`

            Evaluate the function and derivative based on the given `input` and `seed` dictionary. `input` contains key-value pairs of variable's name and its corresponding value. `seed` contains key-value pairs of variable's name and its corresponding seed. If a seed is not provided, this operation will calcualte $\nabla f$, which could be computationally intensive when the input demension is large. To avoid slow execution time, you can specify a desired `seed`.
            
            ```python
            Expression.__call__(dict(Variable: Tuple)) -> Union[Tuple, Dict] 
    
            ```

- `Variable` class

    * Declare variables in forward mode
    ```python
    from auto_diff_2022_fall_team04 import Variable
    x, y = Variable('x'), Variable('y')
    ```

    * Declare variables in reverse mode
    ```python
    from auto_diff_2022_fall_team04 import Variable
    x, y = Variable('x', mode = 'r'), Variable('y', mode = 'r')
    ```
    
- `Function` class
        
    We can create a Function object using the operators specified in the Expression's Operator section.

    Example:

    ```python
    from auto_diff_2022_fall_team04 import Variable
    x, y = Variable('x'), Variable('y')
    f = x + y #create a function f = x + y 
    f_val, f_deriv = f({'x':1, 'y':2}) #return the value of f and the partial derivative at x = 1, y = 2 since the seed is not specified. 

    ```

    Here's a more complicate examples:

    ```python
    from auto_diff_2022_fall_team04 import Variable, Expression
    x, y = Variable('x'), Variable('y')
    f = Expression.sin(x * 4) + Expression.cos(y * 4)        # create a function f = sin(4x) + cos(4y)
    f = f + Expression.exp(x * y)             # f = sin(4x) + cos(4y) + e^(xy)
    f_val, f_deriv = f({'x': [1,4,5], 'y': [2,7,8]}, seed={'x':[1,1,1], 'y':[0,0,0]})   # return the value of f and the derivative in the direction of seed at x = 1, y = 0. 
    ```

    ```python
    from auto_diff_2022_fall_team04 import Variable, Expression
    x, y = Variable('x', mode = 'r'), Variable('y', mode = 'r')
    f = Expression.sin(x * 4) + Expression.cosh(y * 4) + Expression.log_base(x * y,3)       # create a function f = sin(4x) + cosh(4y) + log_base_4(x*y)
    f_val, f_deriv = f({'x': [1, 2, 3], 'y': [2,5,8]})   # return the value of f and the derivative in reverse mode. 
    ```
    



## Software Organization

***Directory Structure:***
    
```  
    team04/
    ├── docs/
    │   ├── documentation.ipynb
    │   ├── milestone1.ipynb
    |   ├── milestone2.ipynb
    |   ├── milestone2_progress.ipynb
    │   ├── dual.md
    │   └── expression.md
    ├── src/
    │   └── auto_diff
    │   │    ├── dual/
    │   │    │    ├── __init__.py
    │   │    │    └── dual.py
    │   │    ├──  expression/
    │   │    │    ├── __init__.py
    │   │    │    ├── expression.py
    │   │    │    ├── node.py
    │   │    │    └── ops.py
    │   │    └── __init__.py
    ├── tests/
    │   ├── dual/
    │   │   └── dual_test.py
    │   ├── expression/
    │   │   ├── expression_test.py
    │   │   ├── function_test.py
    │   │   ├── ops_test.py
    │   │   └── variable_test.py
    │   ├── optimization/
    │   │   └── optimization_demo.py
    │   ├── check_coverage.sh
    │   └── run_test.sh
    ├── LICENSE
    ├── README.md
    ├── requirement.txt
    └── .gitignore
```

- ***team04/***

    This is the project's **root folder**, which contains README file, license, .gitignore, requirement and other sub-directories of source code, tests, and documentations.

- ***team04/docs/***  

    The **team04/docs/** directory contains the **documentation** that explains the usage of the classes and functions defined in this project. In addition, our milestone progress is also stored in this folder.

- ***team04/src/***  

    The **team04/src/** directory contains all the **source code**. Our current plan is to include two sub-packages, dual and expression. Dual provides a DualNumber class, which would carry out the actual computation in expression evaluation. Expression provides support for function declaration and evaluation. Please see the **Implementation** section below for more details. 

- ***team04/tests/***  

    The **team04/tests/** contains the **unit and integration tests** of this project, ensuring the project's proper functioning before release. We followed the Continuous Integration process to streamline the process. Additionally, we used pytest to create concrete test suites and pytest-cov to generate code coverage reports.

***Distribution***:

We plan to distribute AutoDiff on PyPI with PEP518 in the future. Currently, users can install the package manually following the instructions in the "How to Use AutoDiff" section. 

## Implementation

### Core Classes
#### `dual.py`
This module implements dual numbers and provides mathematical operations on dual numbers, which is the underlying data structure for forward mode AD. 
* Attributes :<br>
`real` and `dual` 
* Methods:
In the file, we overloaded some Dunder Methods in Python, including the following: <br>
`__add__` and `__radd__` <br>
`__sub__` and `__rsub__`<br>
`__mul__` and `__rmul__`<br>
`__truediv__` and `__rdiv__` <br>
`__pow__` <br>
`__neg__` <br>
`__len__` <br>
`__iter__`<br>
`__str__` <br>
`__eq__` <br>
Also, we implemented the following static methods: <br>
`exp` <br>
`log` <br>
`sin` <br>
`cos` <br>
`tan` <br>
`sinh` <br>
`cosh` <br>
`tanh` <br>
`arcsin` <br>
`arccos` <br>
`arctan` <br>
`log_base` <br>
`sigmond` <br>
`sqrt` <br>

#### `expression.py`
This is the most important module that users will interact with. It contains three classes, `Expression`, `Variable`, and `Function`. 


##### `Expression`
  This is an abstract base class, and is not intended to be initialized directly. Its children classes, `Variable` and `Function` carry out the real work. We included the same methods implemented in the dual class to enable users to direcly use them. 
 * Additional method: <br>
`__call__` : This method would accept two `dict` as inputs, first of which specified the point to evaluate the function, and the other specified the seed vector. It will return two outputs, first of which will give us the value of object at given point and the other will calculate the derivative. It will return partial derivatives if no seed is given. 

##### `Variable`
Users are expected to declare the variables in their function via this class. 
* Attributes :<br>
`name` - a string that gives a variable its name, such as `x` or `y`. <br>
`mode` - `f`/`b`, a flag to indicate forward or backward mode. Currently we have set `f` as default since we haven't implemetned backward mode.  <br>
* Declare variables: 

  ```python
  from auto_diff_2022_fall_team04 import Variable
  x, y = Variable('x'), Variable('y')
  ```
  Note that the uniqueness of variable names is strongly recommended, but not required. The program would still work if multiple variables are given the same name, in which case all of these variables may take the same value at evaluation, resulting in unexpected behavior.

#### `Function`
A `Function` performs some elementary operation on other `Expression`(s). The construction of a `Function` object should be intuitive. Just combine `Variables` or other `Function` together, either using functions provided in `auto_diff` package (see the list of APIs above) or arithmetic operators would suffice. 
* Attributes :<br> 
`varname` - a set of variables <br> 
`mode` - `f`/`b`, a flag to indicate forward or backward mode. Currently we have set `f` as default since we haven't implemetned backward mode. 
* Methods :<br> 
`forward` - recursively attempt to find the derivative in forward mode. 

* Create a function: 

  ```python
  from auto_diff_2022_fall_team04 import Variable
    x, y = Variable('x'), Variable('y')
    f = x ** y #create a function f = x ^ y 
    f_val, f_deriv = f({'x':1, 'y':2}) #return the value of f and the partial derivative at x = 1, y = 2 since the seed is not specified. 
  
  ```

#### `ops.py`
This module provides mathematical operations for function evaluation. The functions in the module are designed to use only internally in `expression.py`, including elementary operations and trigonometric functions. 

### External Dependencies
We used numpy to help with math calculations. We also used pytest for testing and pytest-cov to generate coverage report. We plan to upload and distribute the package via PyPI in the final milestone. 

## Extension

### Reverse Mode

Our team has chosen to implement the reverse mode automatic differentiation as the extension of this project. To realize reverse mode, we introduced one new `Node` class and added two functions to `Expression` class. 

#### `node.py`

This module the `Node` class which represent one single node in the computational graph. In reverse mode, every object of `Expression` class would be paired up with one `Node` object, which stores the parents and children nodes of the current node. Additionally, this `Node` carries out calculations like computing the partial derivative with respective to its parents in the forward pass as well as aggregating the adjoints of its children to compute its own adjoint. 

* Attributes :<br>
    - `parent`: stores its parent nodes

    - `partial_func`: this is a list of function object that stores the *analytical expression* of the partial derivative of the current node with respective to its one parent. This function is determined at initialization

    - `partial_val`: this is a list of real numbers that represents the *actual value* of the partial derivative of the current node with respective to its one parent. This value is determined during the forward pass, using the functions in `partial_func`

    - `child`: stores the list of children nodes

    - `received`: this is initialized as an empty set. It keeps track of which child had already computed adjoint. 

    - `adjoint`: the adjoint value of the current node, equivalent to $\partial f/ \partial v_j$ for the node $v_j$


* Methods:

    - `update`: it takes in as input the value of current node, and then calculate the actual value of the partial derivative of itself with respective to its parents. These value is stored in `partial_val` attribute
    
    - `notify`: a node would be notified by its children once the child node finished computing adjoint. Inside the function, the parent node would add the child who makes this function call to its `received` set, and then increment its adjoint by the adjoint of is child node
    
    - `compute`: in reverse pass, each `Function` or `Variable` object would call this `compute` function of its node to compute the adjoint of current value. On calling, this function first check if all of its children have finished computing adjoint. If so, then the function would aggregate the results to get the adjoint of the current node and return the adjoint. It is also possible that some of its children haven't calculate their adjoint yet, in which case it would keep waiting and return `None`. Once its adjoint is computated, this function would also notify its parents with the adjoint value
    
    
#### `expression.py`

Two functions, `propagate` and `backward`, are added to `Expression` class to support reverse mode, where `propagate` is used for the forward pass, and `backward` is for the reverse pass. 

- `propagate`

    Similar to the evaluation of expressions in the forward mode, in reverse mode, the autodiff package would first evaluate the values of all functions and variables on calling. In reverse mode, this is done by calling `propagate` function that each `Function` or `Variable` objects implemented. This process reflects the forward pass in reverse mode. 

    On calling, `propagate` would first evaluate the value of current expression. This is done in a recursive fashion similar to the `forward` function in forward mode. Once evaluated, `propagate` would then supply its value to its node via the `update` function. This value is used to compute the acutal value of the partial derivative with repect to its parent in the computational graph.

- `backward`

    As implied by its name, this function is used for backward pass in the reverse mode. On calling, this function asks its node to compute the adjoint by calling `compute` function of its node. If the returned value is not `None`, meaning that its adjoint is already known, it would take one step further back and call the `backward` of its parents. Otherwise, it would do nothing and wait until called by another child node


## Licensing

We choose **MIT License** since we would like to permit unrestricted use and distribution of our program, so the whole community can benefit from it without any legal obstructions. It is compatible with any other open-source licenses as well as closed-source, proprietary products. The MIT License is short and easy for people to understand while it perfectly fits our needs. 

## Broader Impact

Our package was designed to compute the first-order derivative of any given function using both forward and reverse mode of automatic differentiation. Compared to symbolic differentiation and numerical differentiation, automatic differentiation provides a less costly but efficient way of calculating derivatives while maintaining machine precision. Automatic differentiation is widely used in machine learning, data science, audio signal processing and many other fields. We believe the nature of automatic differentiation would be helpful for these large projects, which require a large amount of computing power.
Also, our project can also serve an educational purpose. For example, when students are learning how to calculate derivatives by hands, our package could help the instructor quickly derive the desired derivative, and help students check their attempted solutions quickly. But a potential scenario could be that students skip the process of understanding the concepts of derivative and learning how to solve limits but copy the returned results to their homework. 


## Software Inclusivity

AutoDiff was designed to welcome users and contributors from all backgrounds. The package was developed based on the key principles of python community: mutual respect, tolerance and encouragement. Throughout the process of development, we made every effort to create an inclusive and user-friendly package including but not limited to asking TF for feedback, writing doc strings, and providing sufficient documentation. Every team member contributed to review pull requests, provide feedback to each other, and approve pull requests. Admittedly, the package was written in English and Python, but there will be opportunities to localize the package in the future. We used the MIT license and planned to release the package to the open source community so anyone who has experience in another language will have the opportunity to rewrite and translate the package. 

## Future Features



* We will extend our package to solve optimization problems, so the users can apply the gradient descent or ascent algorithm to find the local minimums or maximums. Basically, for a given function, the user can start with an initial guess and the program will compute the gradient descent or ascent depending on whether the user wants a minimum or maximum, then the function will move steps in the direction of the gradient descent/ascent until the gradient reaches 0. <br />
**Implementation**: We will write a new class called Extrema_Solver(). It will take an objective function, initial guess of position, min/max choice and a threshold for gradient. It will print the lists of local minimums and maximums and return the largest or small value from the list. 
* We would like to generate the computation graphs and trace tables, which will present these useful values in a more straightforward way. Computation graphs are widely used in machine learning and audio signal processing. We believe that the implementation of computation graphs will encourage more people working in these fields to use our package. <br />
**Implementation**: We will write a new class called ComputeGraph(), which will build the computation graph as the computation of automatic differentiation proceeds. 
* We are considering implementing a graphical user interface. The user can just enter the function with x-values and optional seed vector, and then click calculate. It’s very user-friendly, So people who don’t have any coding experience can use our package easily.<br />
**Implementation**: We will write a new class called UserGUI(), which will make use of PySimpleGUI to create a user-friendly graphical user interface. 