In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Progress Update

# So far

- Steep learning curve
- Issue #42 -> on hold until LLVM update
- daphne-opt (run passes on .mlir files)
- DenseMatrix MLIR interop
- Sum Reduce on DenseMatrix
- Sum Reduce on DenseMatrix with value cast
- Benchmarking
- LLVM/MLIR community (Discource, Discord, Meetings, Conf, Papers)

# DenseMatrix Interop

### DenseMatrix -> StridedMemRefType

```c++
StridedMemRefType<double, 2> getMemRefDenseMatrix(
    const DenseMatrix<double> *input, DCTX(ctx))

%9 = "daphne.call_kernel"(%7, %5) {callee = "_getMemRefDenseMatrix__StridedMemRefType___DenseMatrix_double"} : (!daphne.Matrix<10x10xf64>, !daphne.DaphneContext) -> memref<10x10xf64>

```


### StridedMemRefType -> DenseMatrix 
```c++
DenseMatrix<double>* getDenseMatrixFromMemRef(const StridedMemRefType<double, 2>* memRef, DCTX(ctx))
    
%11 = "daphne.call_kernel"(%10, %5) {callee = "_getDenseMatrixFromMemRef__DenseMatrix_double__StridedMemRefType_"} : (memref<10x10xf64>, !daphne.DaphneContext) -> !daphne.Matrix<10x10xf64>


```


# SumAllOp

From

```c++
%10 = "daphne.sumAll"(%8) : (!daphne.Matrix<50000x25000xf64>) -> f64
```

To

```c++
%11 = affine.for %arg0 = 0 to 50000 iter_args(%arg1 = %cst) -> (f64) {
  %cst_0 = constant 0.000000e+00 : f64
  %15 = affine.for %arg2 = 0 to 25000 iter_args(%arg3 = %cst_0) -> (f64) {
    %17 = memref.load %10[%arg0, %arg2] : memref<50000x25000xf64>
    %19 = addf %arg3, %18 : f64
    affine.yield %19 : f64
  }
  %16 = addf %arg1, %15 : f64
  affine.yield %16 : f64
}
```


# SumAllOp with f32 -> f64 cast

From

```c++
%10 = "daphne.call_kernel"(%8, %7) {callee = "_cast__DenseMatrix_double__DenseMatrix_float"} : (!daphne.Matrix<50000x25000xf32>, !daphne.DaphneContext) -> !daphne.Matrix<50000x25000xf64>

%11 = "daphne.call_kernel"(%10, %7) {callee = "_sumAll__double__DenseMatrix_double"} : (!daphne.Matrix<50000x25000xf64>, !daphne.DaphneContext) -> f64
```

To

```c++
%11 = affine.for %arg0 = 0 to 50000 iter_args(%arg1 = %cst) -> (f64) {
  %cst_0 = constant 0.000000e+00 : f64
  %15 = affine.for %arg2 = 0 to 25000 iter_args(%arg3 = %cst_0) -> (f64) {
    %17 = memref.load %10[%arg0, %arg2] : memref<50000x25000xf32>
    %18 = fpext %17 : f32 to f64
    %19 = addf %arg3, %18 : f64
    affine.yield %19 : f64
  }
  %16 = addf %arg1, %15 : f64
  affine.yield %16 : f64
}
```


# Performance Comparison

!['debug'](sumall/results/01_debug.png)

!['kb'](sumall/results/02_kb.png)

!['mb'](sumall/results/03_mb.png)

!['gb'](sumall/results/04_gb.png)

!['overview'](sumall/results/05_overview.png)

!['delta'](sumall/results/06_delta.png)

!['type_cast'](sumall_typecast/type_cast.png)

!['type_cast_speedup'](sumall_typecast/type_cast_speedup.png)

# Next Steps

- Interop cont.
- Multi-threaded execution
- MatMulOp
- Compare binary sizes precompiled vs codegen