<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS107/AC207 Team04 Final Project </h1>

# Automatic Differentiation Package

**Harvard University**<br/>
**Fall 2022**<br/>
**Team member**: Yixian Gan, Siyao Li, Ting Chen, Li Yao, Haitian Liu



<hr style='height:2px'>

## Introduction
**Automatic Differentiation (AD)** project is a python package that realizes forward mode automatic differentiation method on custom input fucntions. 

In scientific research or engineering projects, sometimes we would want to compute the derivative of certain functions (the $f'(x)$ term in Newton's method for example.) For simple input fucntions, we can compute an exact analytical solution with ease. However, once the inputs became complicated, it may be hard or even impossible to calculate an analytical solution. This problem becomes especially intractable in deep learning, where we are interested in the derivative of model losses with respect to input features, both of which could be vectors with hundreds of dimensions.

An alternative way is to compute the derivative using numerical method like automatic differentiation. It breaks down large, complex input function into the product of elementary functions, whose derivative are trivial to compute. By tracing the gradient of intermediate results and repeatedly applying chain rule, AD is able to compute the gradient of any input function in a certain direction. This carries significant importance as almost all machine learning methods rely on gradient descent, and the absolute prerequisite of gradient descent is to compute the gradient.    



## Background

*This section provides a brief overview of the mechanism of AD. Users not interested in the math may skip to* **Usage** *section below*

- **Elementary Operation**

  The key concept of AD is to break down a complicated function into small, managable steps, and solving each step individually. Typically, each step in AD would only perform one elementary operation. Here, "Elementary Operations" refer to both arithmatic operation (`+`, `-`, `*`, scalar division, power operation, etc.), and elementary functions (`exp`, `log`, `sin `, `cos`, etc.) These elementary operations should take only one or two inputs, and its partial derivative with respect to both inputs should be easy to compute. We would later chain these intermediate derivatives to get the overall result.
  
- **Chain Rule**

  Chain rule in calculus is the rule to compute the derivative of compound functions. It allows us to write the derivative of compound function as the product of derivatives of simple functions. The simpliest case is taking the derivative of a scalar function of only one scalar variable 
  
  $$\frac{d}{dx}f(u(x))=\frac{df(u)}{du}\frac{du(x)}{dx}$$
  
  A more general case is to have a function $f$ of a n-dimension vector variable $\textbf{x}=(x_1, x_2, ...x_n)$. Then, instead of derivate, we would like to compute the gradient of $f$ with respect to $\textbf{x}$. Suppose $f$ is a function of vector $\textbf{y}$, which itself is a function of vector $\textbf{x}$. The chain rule for multivariate function is given by 
  
  $$\nabla_xf=\frac{\partial f}{\partial y_1}\nabla_xy_1+\frac{\partial f}{\partial y_2}\nabla_xy_2+... $$
  $$=\sum_i \frac{\partial f}{\partial y_i}\nabla_xy_i$$
  
  The chain rule is exceptionally useful in AD method as we can imagine $\textbf{y}$'s to be the intermediate result at each step, then by chain rule, the gradient of the interested funtion is just the production of gradients calculated in each small step. 

- **Directional Derivative** $D_p$
 
    An intuitive way to think of gradient is the direction in the n-dimensional space in which the function $f(x_1, x_2, ...,x_n)$ increases the fastest. For a function of a n-dimensional variable $\textbf{x}$, its gradient is also a n-dimensional vector. Therefore, storing the gradient of every intermediate result in AD can be computationally costly (there might be millions of intermediate results in some complicated computation!) A remedy to this is to store the directional derivatives instead. The inituition behind directional derivative is that instead of the direction of fastese ascending, we would calculate the ascending rate along a certain direction of insterest. Mathematically, the directional derivative of $f(\textbf{x})$ in direction $\textbf{p}$ is defined as the *projection* of gradient of $f$ on direction $\textbf{p}$.
 
    $$D_{\textbf{p}}f=\nabla_xf\cdot \textbf{p}$$
 
    Therefore, instead of the gradient of each intermediate result, we would store only the directional derivative of each intermediate result. These directional derivatives are dot products of vectors, so they are all scalars themselve, which are much more efficient to store.
 
- **Computational Graph**
 
    A computational graph is just a directed graph that descripts how to break down the complicated function into elementary operations, and what are the intermediate values to be computed. The vertices in the computational graph are intermediate values and the edges are elementary operations. An edge from $v_1$ to $v_2$ means to perform a certain elementary operation on intermediate value $v_1$ to get the next intermediate value $v_2$.
 
- **Trace**
 
    Traces simply mean the values we would like to keep track of in the forward pass in AD. For forward-mode AD, which is the backbone of this project, there are two traces, *Primal Trace* and *Tangent Trace*.
 
    **Primal trace** stores the elementary operation to get one intermediate value from previous results.
 
    For example $f(x)=e^{-\sin(x)}$, its primal trace is then
 
    $$v_0=x$$
    $$v_1=\sin(v_0)$$
    $$v_2=-v_1$$
    $$v_3=exp(v_2)$$
 
    Primal trace provides the recipe for each intermediate value and eventually leads us to the final answer.
 
    **Tangent trace** stores the *directional derivative* of an intermediate value. Thanks to chain rule, the tangent trace of $v_j$ can be written as the product of $\frac{dv_j}{dv_i}D_pv_i$, where $v_i$ is some other intermediate value from which $v_j$ is computed.
 
    Using the same example as before, the tangent trace of $f(x)$ is
 
    $$D_pv_0=1$$
    $$D_pv_1=\frac{dv_1}{dv_0}D_pv_0=\frac{d\sin(v_0)}{dv_0}D_pv_0=\cos(v_0)D_pv_0$$
    $$D_pv_2=\frac{d}{dv_1}(-v_1)D_pv_1=-D_pv_1$$
    $$D_pv_3=\frac{d}{dv_2}exp(v_2)D_pv_2=exp(v_2)D_pv_2$$


## How to use *ADPackage*


1. Install Package from PyPI
    - MacOS / Linux
    
        ```bash
        $ python -m pip install ADPackage
        ```

    - Windows

        ```bash
        $ py -m pip install ADPackage
        ```

2. Import to module

    ```python
    import ADPackage as AD
    ```

3. Utilize the classes and methods defined in the package

    ```python
    obj = AD.object_type()  
    result = AD.function_name()
    ```

## Software Organization

    Our directory structure will look like:
    
    ```
    team04/
    ├── docs/
    │   ├── milestone1.ipynb
    │   ├── placeholder
    │   └── placeholder
    ├── lib/
    │   └── lib_A/
    ├── src/
    │   ├── __init__.py
    │   ├── forward/
    │   │   ├── __init__.py
    │   │   ├── placeholder
    │   │   └── placeholder
    │   └── reverse/
    │   │   │   ├── __init__.py
    │   │   ├── placeholder
    │   │   └── placeholder
    ├── tests/
    │   └── placeholder/
    ├── LICENSE
    ├── README.md
    └── .gitignore
    ```


- ***team04/***

    The *team04/* is the project parent directory where all the related files are located from source to binaries. The name of the folder should be team name. The folder would contain the source code, libraries, assets, debugging, testing, and release files. Also, the folder in *team04/include/* should be named the same as the *Project_Name/*.

- ***team04/docs/***  

    The *Project_Name/debug/* directory includes the compile, debugging, run, and binaries files of the program. Is the test directory for change in the program and where all the cache files are placed on compiled, debugged, and runned. The folder tree for a debugging session with VS Code and Clang++ would look like 

- ***team04/lib/***  

    The *Project_Name/lib/* directory consists all the **third party libraries** that are needed by your project. Usually if you look into any of the third party libraries present here, they would be following a similar structure that you are using for your project. A point to note is there are two ways of using third party libraries in C++ — **static** and **dynamic**. This lib directory is only for static ones. The folder tree would look like

    ```
    team04/
    ├── lib/
    │   ├── lib_A/
    │   └── lib_B/
    ```

- ***team04/src/***  

    The *Project_Name/src/* directory contains all the **source code** and the **header files** that are private and for internal use only. All the code that your project consists of must go in here. Other directories have the cmoponents needed to run, debug, and release the program but the src directory has the program itself. The folder may have subdirectories to separate functions, components, and other files. The folder tree would look like

    The *Project_Name/src/utils/* directory contains code snippets and functions needed throughout the source code. They are like small functions to build bigger and more complicated code. Sometimes it is also called modules. The folder tree would look like

    ```
    team04/
    ├── src/
    │   ├── utils/
    │   ├── modules
    │   ├── main.cpp
    │   └── private_headers.h
    ```

- ***team04/tests/***  

    As the name suggests, code for unit testing is kept in this directory. Different versions of the programs such as alpha or beta developer versions are stored and tested in this directory. The files in this directory should be pre release, not finished code. When the code revision is finished it can be moved to a release directory with a version number. The folder tree would look like

    ```
    team04/
    ├── tests/
    │   ├── alpha_version/
    │   └── beta_version/
    ```

## Implementation


## Licensing

We choose **MIT License** since we would like to permit unrestricted use and distribution of our program. It is compatible with any other open-source licenses as well as closed-source, proprietary products. The MIT License is short and easy for people to understand while it perfectly fits our need. 