## How to optimize an MLIR-AIR program using AIR-Runner performance model

This notebook should give a step-by-step guide on how to generate performance time traces using MLIR-AIR's performance simulator, AIR-Runner, and how AIR-Runner could help a user with optimizing an MLIR-AIR program. In this demonstration, we use matrix multiplication as an example.

### Outline
- Compile a matrix multiplication program written in MLIR's LinAlg dialect (one single `linalg.matmul` op) using MLIR-AIR.
- Use AIR-Runner to simulate the performance of the compiled program.
- Based on the simulation time traces, iteratively apply MLIR-AIR compiler optimizations to improve the program's performance on AIEs.

### Compile a matrix multiplication program written in MLIR's LinAlg dialect (one single `linalg.matmul` op) using MLIR-AIR

Use MLIR's Python binding to create the input program: a single `linalg.matmul` operation.

In [1]:
import air.compiler.util

from air.mlir.dialects import func
from air.mlir.dialects import linalg
from air.mlir.ir import *
import air.mlir.passmanager

import sys

def matmul_on_tensors(m, n, k, dtype):
    module = Module.create()
    with InsertionPoint(module.body):
        @func.FuncOp.from_py_func(
            RankedTensorType.get((m, k), dtype), RankedTensorType.get((k, n), dtype),
            RankedTensorType.get((m, n), dtype))
        def matmul(lhs, rhs, out):
            linalg.matmul(lhs, rhs, outs=[out])
    return module

Compile the program using MLIR-AIR, with a stack of compilation passes described in `pipeline`. To print out the compiled MLIR-AIR program, please uncomment `print (air_module)` at the end of the script.

In [2]:
with air.mlir.ir.Context(), Location.unknown():

    air_module = matmul_on_tensors(256, 256, 1536, BF16Type.get())
    
    # convert linalg on tensors to linalg on memrefs
    pm = air.mlir.passmanager.PassManager.parse(air.compiler.util.LINALG_TENSOR_TO_MEMREF_PIPELINE)
    pm.run(air_module.operation)

    # tile and map to air
    pipeline = "builtin.module("+",".join([
        # L1 and L2 tiling
        "air-linalg-codegen{l2-tile-size=64,64,128 l2-promote=true l1-tile-size=32,32,32 l1-promote=true}",
        # clean up
        "canonicalize", "cse",
        # bind depth 1 scf.parallel op as air.herd (i.e. AIE kernel)
        "air-par-to-herd{depth=1}",
        # bind data copy ops to AIE's DMA resources
        "air-copy-to-dma",
        # bind depth 0 scf.parallel op as air.launch
        "air-par-to-launch{has-air-segment=true}",
        "canonicalize", "cse",
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

    # generate dependency information for runner
    pipeline = "builtin.module("+",".join([
        # analyze the data dependency between asynchronous events
        "air-dependency",
        # convert air.dma data movement ops into half-DMA 'air.channel' puts and gets
        "air-dma-to-channel",
        "canonicalize", "cse",
        # clean up dependency graph
        "air-dependency-canonicalize",
        "canonicalize", "cse",
        # place air.herd to physical locations in air.segment (Greedy algorithm)
        "air-place-herds{num-rows=2 num-cols=2 row-anchor=0 col-anchor=0}"
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

    # print ("\nAIR Dialect Module (async)\n")
    # print (air_module)

### Use AIR-Runner to simulate the performance of the compiled program.

AIR-Runner analyzes the program's schedule, and based on a user-provided resource model of the target device, generates time traces to estimate the performance. The resource model is queried at run-time throughout the simulation, so that both static and dynamic resource allocations are monitored, and resource constraints are constantly enforced.

Below is an example resource model which describes a hypothetical custom hardware, named "testdevice".

In [3]:
arch = {
    "clock": 1000000000,
    "cores": 1,
    "datatypes": [
        {
        "bytes": 1,
        "name": "i8"
        },
        {
        "bytes": 2,
        "name": "bf16"
        },
        {
        "bytes": 4,
        "name": "i32"
        }
    ],
    "devicename": "testdevice",
    "kernels": {
        "linalg.copy": {
            "datatypes": {
                "i8": {
                    "ops_per_core_per_cycle": 32,
                    "efficiency": 1
                },
                "bf16": {
                    "ops_per_core_per_cycle": 32,
                    "efficiency": 1
                },
                "i32": {
                    "ops_per_core_per_cycle": 16,
                    "efficiency": 1
                }
            },
            "name": "linalg.copy"
        },
        "linalg.fill": {
            "datatypes": {
                "i8": {
                    "ops_per_core_per_cycle": 32,
                    "efficiency": 1
                },
                "bf16": {
                    "ops_per_core_per_cycle": 32,
                    "efficiency": 1
                },
                "i32": {
                    "ops_per_core_per_cycle": 16,
                    "efficiency": 1
                }
            },
            "name": "linalg.fill"
        },
        "linalg.generic": {
            "datatypes": {
                "i8": {
                    "ops_per_core_per_cycle": 1,
                    "efficiency": 1
                },
                "bf16": {
                    "ops_per_core_per_cycle": 1,
                    "efficiency": 1
                },
                "i32": {
                    "ops_per_core_per_cycle": 1,
                    "efficiency": 1
                }
            },
            "name": "linalg.generic"
        },
        "linalg.matmul": {
            "datatypes": {
                "i8": {
                    "macs_per_core_per_cycle": 256,
                    "efficiency": 1
                },
                "bf16": {
                    "macs_per_core_per_cycle": 128,
                    "efficiency": 1
                },
                "i32": {
                    "macs_per_core_per_cycle": 32,
                    "efficiency": 1
                }
            },
            "name": "linalg.matmul"
        }
    },
    "dus": {
        "count": [4, 4],
        "memory": {
            "memory_space": "L2",
            "bytes": 524288
        },
        "ports": {
            "outbound": {
                "count": 6,
                "bytes_per_second": 4000000000
            },
            "inbound": {
                "count": 6,
                "bytes_per_second": 4000000000
            }
        },
        "tiles": {
            "count": [1, 4],
            "memory": {
                "memory_space": "L1",
                "bytes": 65536
            },
            "ports": {
                "outbound": {
                    "count": 2,
                    "bytes_per_second": 4000000000
                },
                "inbound": {
                    "count": 2,
                    "bytes_per_second": 4000000000
                }
            }
        }
    },
    "noc": {
        "outbound": {
            "count": 4,
            "bytes_per_second": 4000000000
        },
        "inbound": {
            "count": 4,
            "bytes_per_second": 4000000000
        }
    }
  }


Here are some utility functions to print out some performance stats.

In [4]:
import json
import re
import math
import numpy

def printMemoryFootprint(arch, trace_file_name):

    # look up byte size per data type
    datatype = arch["datatypes"]

    # trace model
    with open(trace_file_name) as f:
        trace_model = json.load(f)

    # init dict
    trace_dict = {}
    for d in trace_model:
        if d and d["name"] == "process_name":
            trace_dict[d["pid"]] = [d["args"]["name"], 0, 0]
            name_split = re.split(r'\[|\]|, ', d["args"]["name"])
            trace_dict[d["pid"]].append([eval(i) for i in name_split[1:-1]])

    # go through traces
    for d in trace_model:
        if d and "AllocOp" in d["name"] and d["ph"] == "B":
            name_split = re.split(r'\(|\)|, ', d["name"])
            vol = int(name_split[2])
            ty = name_split[3]
            datatype_size = [item["bytes"] for item in datatype if item["name"] == ty][0]
            size = vol * datatype_size
            trace_dict[d["pid"]][1] = trace_dict[d["pid"]][1] + size
            trace_dict[d["pid"]][2] = max(trace_dict[d["pid"]][1], trace_dict[d["pid"]][2])
        if d and "DeallocOp" in d["name"] and d["ph"] == "B":
            name_split = re.split(r'\(|\)|, ', d["name"])
            vol = int(name_split[2])
            ty = name_split[3]
            datatype_size = [item["bytes"] for item in datatype if item["name"] == ty][0]
            size = vol * datatype_size
            trace_dict[d["pid"]][1] = trace_dict[d["pid"]][1] - size

    # get device capacity from device model
    L1_memory_size = arch["dus"]["tiles"]["memory"]["bytes"]
    L2_memory_size = arch["dus"]["memory"]["bytes"]
    du_shape = arch["dus"]["tiles"]["count"]

    # performance stats
    for i in range(1, len(trace_dict) + 1):
        if "air.herd" in trace_dict[i][0]:
            proc_name = trace_dict[i][0]
            mem_usage = trace_dict[i][2]
            print(proc_name + " " + str(i) + " L1 memory peak util.: %" + str(float(mem_usage) / float(L1_memory_size) * 100))
        if "air.segment" in trace_dict[i][0]:
            proc_name = trace_dict[i][0]
            mem_usage = trace_dict[i][2]
            du_usage = numpy.prod([math.ceil(j / k) for j, k in zip(trace_dict[i][3], du_shape)])
            print(proc_name + " " + str(i) + " L2 memory peak util.: %" + str(float(mem_usage) / float(L2_memory_size * du_usage) * 100))
        

def printHerdComputeEfficiency(arch, trace_file_name):

    # trace model
    with open(trace_file_name) as f:
        trace_model = json.load(f)

    # init dict
    trace_dict = {}
    for d in trace_model:
        if d and d["name"] == "process_name" and "air.herd" in d["args"]["name"]:
            trace_dict[d["pid"]] = [d["args"]["name"], 0, 0]

    # go through traces
    for d in trace_model:
        if d and "LinalgOp" in d["name"] and d["ph"] == "B":
            trace_dict[d["pid"]][1] = float(d["ts"])
        if d and "LinalgOp" in d["name"] and d["ph"] == "E":
            compute_latency = float(d["ts"]) - trace_dict[d["pid"]][1]
            trace_dict[d["pid"]][2] = trace_dict[d["pid"]][2] + compute_latency
            trace_dict[d["pid"]][1] = 0
        if d and "LaunchTerminator" in d["name"] and d["ph"] == "E":
            end_time = float(d["ts"])
    
    # performance stats
    for key, value in trace_dict.items():
        if "air.herd" in value[0]:
            proc_name = value[0]
            print(proc_name + " " + str(key) + " herd compute efficiency: %" + str(float(value[2]) / float(end_time) * 100))



AIR-Runner expects two inputs: an MLIR-AIR program file and a JSON resource model which describes the target device. It generates a JSON time trace file formatted for Chrome tracing or Perfetto trace viewers.

In [5]:
# arch: resource model. "trace.out": output file name. "herd": simulation granularity (per herd or per core).
runner = air.compiler.util.Runner(arch, "trace.out", "herd")
# air_module: compiled MLIR-AIR program. "matmul": function name for simulation
trace = runner.run(air_module, "matmul")

# performance evaluation
printMemoryFootprint(arch, "trace.out")
printHerdComputeEfficiency(arch, "trace.out")

Latency: 1895.697us
air.segment[2, 2] 2 L2 memory peak util.: %3.90625
air.herd[2, 2] 3 L1 memory peak util.: %9.375
air.herd[2, 2] 3 herd compute efficiency: %10.371283159325136


After it finishes, a trace file named "trace.out" is generated. This trace file can be visualized using Perfetto: https://ui.perfetto.dev/

### Based on the simulation time traces, iteratively apply MLIR-AIR compiler optimizations to improve the program's performance on AIEs.

- Optimization 1: data broadcasting, for loop data movement hoisting.

Firstly, we can leverage AIE device's data broadcasting flexibility by broadcast copying pixels to multiple AIE tiles in parallel. Secondly, the partial pixels do not need to be copied in and out of the tile memory for every iteration of the for loop--it can persist in the tile memory.

The `-air-dependency-schedule-opt` compilation pass can automatically detect such opportunities and perform optimizations. The `-air-specialize-dma-broadcast` specializes the code to explicitly represent the broadcast patterns for downstream compiler passes to map to hardware.

In [6]:
with air.mlir.ir.Context(), Location.unknown():

    air_module = matmul_on_tensors(256, 256, 1536, BF16Type.get())
    
    # convert linalg on tensors to linalg on memrefs
    pm = air.mlir.passmanager.PassManager.parse(air.compiler.util.LINALG_TENSOR_TO_MEMREF_PIPELINE)
    pm.run(air_module.operation)

    # tile and map to air
    pipeline = "builtin.module("+",".join([
        "air-linalg-codegen{l2-tile-size=64,64,128 l2-promote=true l1-tile-size=32,32,32 l1-promote=true}",
        "canonicalize", "cse",
        "air-par-to-herd{depth=1}",
        "air-copy-to-dma",
        "air-par-to-launch{has-air-segment=true}",
        "canonicalize", "cse",
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

    # generate dependency information for runner
    pipeline = "builtin.module("+",".join([
        "air-dependency",
        "air-dependency-schedule-opt", # <--------- new pass
        "air-specialize-dma-broadcast", # <-------- new pass
        "air-dma-to-channel",
        "canonicalize", "cse",
        "air-dependency-canonicalize",
        "canonicalize", "cse",
        "air-place-herds{num-rows=2 num-cols=2 row-anchor=0 col-anchor=0}"
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

runner = air.compiler.util.Runner(arch, "trace.out", "herd")
trace = runner.run(air_module, "matmul")

printMemoryFootprint(arch, "trace.out")
printHerdComputeEfficiency(arch, "trace.out")

Latency: 1641.377us
air.segment[2, 2] 2 L2 memory peak util.: %3.90625
air.herd[2, 2] 3 L1 memory peak util.: %9.375
air.herd[2, 2] 3 herd compute efficiency: %11.978242645195165


- Optimization 2: Ping-pong buffering (both L1 and L2).

Automatically double the buffer allocation to enable compute to overlap with data movement.

In [7]:
with air.mlir.ir.Context(), Location.unknown():

    air_module = matmul_on_tensors(256, 256, 1536, BF16Type.get())
    
    # convert linalg on tensors to linalg on memrefs
    pm = air.mlir.passmanager.PassManager.parse(air.compiler.util.LINALG_TENSOR_TO_MEMREF_PIPELINE)
    pm.run(air_module.operation)

    # tile and map to air
    pipeline = "builtin.module("+",".join([
        "air-linalg-codegen{l2-tile-size=64,64,128 l2-promote=true l1-tile-size=32,32,32 l1-promote=true}",
        "canonicalize", "cse",
        "air-par-to-herd{depth=1}",
        "air-copy-to-dma",
        "air-par-to-launch{has-air-segment=true}",
        "canonicalize", "cse",
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

    # generate dependency information for runner
    pipeline = "builtin.module("+",".join([
        "air-dependency",
        "air-dependency-schedule-opt",
        "air-specialize-dma-broadcast",
        "air-dma-to-channel",
        "canonicalize", "cse",
        "air-dependency-canonicalize",
        "canonicalize", "cse",
        "air-place-herds{num-rows=2 num-cols=2 row-anchor=0 col-anchor=0}",
        "air-label-scf-for-to-ping-pong", # <------ new pass
        "air-ping-pong-transform" # <-------------- new pass
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

runner = air.compiler.util.Runner(arch, "trace.out", "herd")
trace = runner.run(air_module, "matmul")

printMemoryFootprint(arch, "trace.out")
printHerdComputeEfficiency(arch, "trace.out")

Latency: 897.521us
air.segment[2, 2] 2 L2 memory peak util.: %7.03125
air.herd[2, 2] 3 L1 memory peak util.: %15.625
air.herd[2, 2] 3 herd compute efficiency: %21.905695694803498


- Optimization 3: Use up all available memory resources.

Allocate all available L1 and L2 memory to the program. Finetune the tiling factors to balance between compute and communication.

In [8]:
with air.mlir.ir.Context(), Location.unknown():

    air_module = matmul_on_tensors(256, 256, 1536, BF16Type.get())
    
    # convert linalg on tensors to linalg on memrefs
    pm = air.mlir.passmanager.PassManager.parse(air.compiler.util.LINALG_TENSOR_TO_MEMREF_PIPELINE)
    pm.run(air_module.operation)

    # tile and map to air
    pipeline = "builtin.module("+",".join([
        "air-linalg-codegen{l2-tile-size=128,128,384 l2-promote=true l1-tile-size=64,64,96 l1-promote=true}",
        "canonicalize", "cse",
        "air-par-to-herd{depth=1}",
        "air-copy-to-dma",
        "air-par-to-launch{has-air-segment=true}",
        "canonicalize", "cse",
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

    # generate dependency information for runner
    pipeline = "builtin.module("+",".join([
        "air-dependency",
        "air-dependency-schedule-opt",
        "air-specialize-dma-broadcast",
        "air-dma-to-channel",
        "canonicalize", "cse",
        "air-dependency-canonicalize",
        "canonicalize", "cse",
        "air-place-herds{num-rows=2 num-cols=2 row-anchor=0 col-anchor=0}",
        "air-label-scf-for-to-ping-pong",
        "air-ping-pong-transform"
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

runner = air.compiler.util.Runner(arch, "trace.out", "herd")
trace = runner.run(air_module, "matmul")

printMemoryFootprint(arch, "trace.out")
printHerdComputeEfficiency(arch, "trace.out")

Latency: 528.485us
air.segment[2, 2] 2 L2 memory peak util.: %40.625
air.herd[2, 2] 3 L1 memory peak util.: %87.5
air.herd[2, 2] 3 herd compute efficiency: %37.20226156326396


- Optimization 4: Double mem tile port usage.

The target device has unused ports for data movement between L1 and L2. Allocate more ports to increase the bandwidth.

In [9]:
with air.mlir.ir.Context(), Location.unknown():

    air_module = matmul_on_tensors(256, 256, 1536, BF16Type.get())
    
    # convert linalg on tensors to linalg on memrefs
    pm = air.mlir.passmanager.PassManager.parse(air.compiler.util.LINALG_TENSOR_TO_MEMREF_PIPELINE)
    pm.run(air_module.operation)

    # tile and map to air
    pipeline = "builtin.module("+",".join([
        "air-linalg-codegen{l2-tile-size=128,128,384 l2-promote=true l1-tile-size=64,64,96 l1-promote=true}",
        "canonicalize", "cse",
        "air-par-to-herd{depth=1}",
        "air-copy-to-dma",
        "air-par-to-launch{has-air-segment=true}",
        "canonicalize", "cse",
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

    # generate dependency information for runner
    pipeline = "builtin.module("+",".join([
        "air-dependency",
        "air-dependency-schedule-opt",
        "air-specialize-dma-broadcast",
        "air-dma-to-channel",
        "canonicalize", "cse",
        "air-dependency-canonicalize",
        "canonicalize", "cse",
        "air-place-herds{num-rows=2 num-cols=2 row-anchor=0 col-anchor=0}",
        "air-unroll-channel-by-factor{channel-name=channel_5 unroll-dim=0 unroll-factor=2}",
        "air-unroll-channel-by-factor{channel-name=channel_6 unroll-dim=0 unroll-factor=2}",
        "air-label-scf-for-to-ping-pong",
        "air-ping-pong-transform"
    ])+')'
    pm = air.mlir.passmanager.PassManager.parse(pipeline)
    pm.run(air_module.operation)

runner = air.compiler.util.Runner(arch, "trace.out", "herd")
trace = runner.run(air_module, "matmul")

printMemoryFootprint(arch, "trace.out")
printHerdComputeEfficiency(arch, "trace.out")

Latency: 418.125us
air.segment[2, 2] 2 L2 memory peak util.: %40.625
air.herd[2, 2] 3 L1 memory peak util.: %87.5
air.herd[2, 2] 3 herd compute efficiency: %47.02145774937579


Copyright (C) 2023, Advanced Micro Devices, Inc. All rights reserved.

SPDX-License-Identifier: MIT