## OpenLane Colab

This Google Colab notebook will:
* Install OpenLane and its dependencies
* Run a simple design, namely a serial-parallel multiplier, through the flow
  and targeting the [open source sky130 PDK](https://github.com/google/skywater-pdk/)
  by Google and Skywater.

In [3]:
# @title Setup Nix {display-mode: "form"}
# @markdown <img src="https://raw.githubusercontent.com/NixOS/nixos-artwork/master/logo/nix-snowflake.svg" width="32"/>
# @markdown
# @markdown Nix is a package manager with an emphasis on reproducible builds,
# @markdown and it is the primary method for installing OpenLane 2.
# @markdown
# @markdown This step installs the Nix package manager and enables the
# @markdown experimental "flakes" feature.
# @markdown
# @markdown If you're not in a Colab, this just sets the environment variables.
# @markdown You will need to install Nix and enable flakes on your own following
# @markdown [this guide](https://openlane2.readthedocs.io/en/stable/getting_started/common/nix_installation/index.html).
import os
import sys
import shutil

os.environ["LOCALE_ARCHIVE"] = "/usr/lib/locale/locale-archive"

if "google.colab" in sys.modules:
    if shutil.which("nix-env") is None:
        !curl -L https://nixos.org/nix/install | bash -s -- --daemon --yes
        !echo "extra-experimental-features = nix-command flakes" >> /etc/nix/nix.conf
        !killall nix-daemon
else:
    if shutil.which("nix-env") is None:
        raise RuntimeError("Nix is not installed!")

os.environ["PATH"] = f"/nix/var/nix/profiles/default/bin/:{os.getenv('PATH')}"

In [4]:
# @title Get OpenLane {display-mode: "form"}
# @markdown Click the ▷ button to download and install OpenLane.
# @markdown
# @markdown This will install OpenLane's tool dependencies using Nix,
# @markdown and OpenLane itself using PIP.
# @markdown
# @markdown Note that `python3-tk` may need to be installed using your OS's
# @markdown package manager.
import os
import subprocess
import IPython

openlane_version = "version-2.1"  # @param {key:"OpenLane Version", type:"string"}

if openlane_version == "latest":
    openlane_version = "main"

pdk_root = "~/.volare"  # @param {key:"PDK Root", type:"string"}

pdk_root = os.path.expanduser(pdk_root)

pdk = "sky130"  # @param {key:"PDK (without the variant)", type:"string"}

openlane_ipynb_path = os.path.join(os.getcwd(), "openlane_ipynb")

display(IPython.display.HTML("<h3>Downloading OpenLane…</a>"))


TESTING_LOCALLY = False
!rm -rf {openlane_ipynb_path}
!mkdir -p {openlane_ipynb_path}
if TESTING_LOCALLY:
    !ln -s {os.getcwd()} {openlane_ipynb_path}
else:
    !curl -L "https://github.com/efabless/openlane2/tarball/{openlane_version}" | tar -xzC {openlane_ipynb_path} --strip-components 1

try:
    import tkinter
except ImportError:
    if "google.colab" in sys.modules:
        !sudo apt-get install python-tk

try:
    import tkinter
except ImportError as e:
    display(
        IPython.display.HTML(
            '<h3 style="color: #800020";>❌ Failed to import the <code>tkinter</code> library for Python, which is required to load PDK configuration values. Make sure <code>python3-tk</code> or equivalent is installed on your system.</a>'
        )
    )
    raise e from None


display(IPython.display.HTML("<h3>Downloading OpenLane's dependencies…</a>"))
try:
    subprocess.check_call(
        ["nix", "profile", "install", ".#colab-env", "--accept-flake-config"],
        cwd=openlane_ipynb_path,
    )
except subprocess.CalledProcessError as e:
    display(
        IPython.display.HTML(
            '<h3 style="color: #800020";>❌ Failed to install binary dependencies using Nix…</h3>'
        )
    )

display(IPython.display.HTML("<h3>Downloading Python dependencies using PIP…</a>"))
try:
    subprocess.check_call(
        ["pip3", "install", "."],
        cwd=openlane_ipynb_path,
    )
except subprocess.CalledProcessError as e:
    display(
        IPython.display.HTML(
            '<h3 style="color: #800020";>❌ Failed to install Python dependencies using PIP…</h3>'
        )
    )
    raise e from None

display(IPython.display.HTML("<h3>Downloading PDK…</a>"))
import volare

volare.enable(
    volare.get_volare_home(pdk_root),
    pdk,
    open(
        os.path.join(openlane_ipynb_path, "openlane", "open_pdks_rev"),
        encoding="utf8",
    )
    .read()
    .strip(),
)

sys.path.insert(0, openlane_ipynb_path)
display(IPython.display.HTML("<h3>⭕️ Done.</a>"))

import logging

# Remove the stupid default colab logging handler
logging.getLogger().handlers.clear()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 9491k    0 9491k    0     0  7897k      0 --:--:--  0:00:01 --:--:-- 31.1M


Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

In [5]:
import openlane

print(openlane.__version__)

2.1.11


### Creating the design

Now that OpenLane is set up, we can write a Verilog file as follows:

In [None]:
%%writefile spm.v
// ===========================================
// 4×4 Systolic Array with FSM Controller
// Pure Verilog Design
// ===========================================

`timescale 1ns/1ps

// ===========================================
// FSM-Controlled 4×4 Systolic Array Top Module
// ===========================================
module systolic_array_4x4 (
    input         clk,
    input         rst_n,
    input         start,              // Start computation

    // Matrix A inputs (row-wise)
    input  [15:0] matrix_a_00, matrix_a_01, matrix_a_02, matrix_a_03,
    input  [15:0] matrix_a_10, matrix_a_11, matrix_a_12, matrix_a_13,
    input  [15:0] matrix_a_20, matrix_a_21, matrix_a_22, matrix_a_23,
    input  [15:0] matrix_a_30, matrix_a_31, matrix_a_32, matrix_a_33,

    // Matrix B inputs (column-wise)
    input  [7:0]  matrix_b_00, matrix_b_01, matrix_b_02, matrix_b_03,
    input  [7:0]  matrix_b_10, matrix_b_11, matrix_b_12, matrix_b_13,
    input  [7:0]  matrix_b_20, matrix_b_21, matrix_b_22, matrix_b_23,
    input  [7:0]  matrix_b_30, matrix_b_31, matrix_b_32, matrix_b_33,

    // Results output (4×4 matrix)
    output [31:0] result_00, result_01, result_02, result_03,
    output [31:0] result_10, result_11, result_12, result_13,
    output [31:0] result_20, result_21, result_22, result_23,
    output [31:0] result_30, result_31, result_32, result_33,

    output        computation_done,
    output        result_valid
);

    // FSM States
    parameter IDLE       = 3'b000;
    parameter LOAD_DATA  = 3'b001;
    parameter COMPUTE    = 3'b010;
    parameter DRAIN      = 3'b011;
    parameter DONE       = 3'b100;

    // FSM signals
    reg [2:0] current_state, next_state;
    reg [3:0] cycle_counter;
    reg [3:0] compute_counter;

    // Control signals
    reg enable_pe;
    reg clear_accum_pe;
    reg data_feed_enable;
    reg weight_feed_enable;

    // Data input scheduling registers
    reg [15:0] data_schedule [0:6][0:3];  // 7 cycles, 4 data inputs
    reg [7:0]  weight_schedule [0:6][0:3]; // 7 cycles, 4 weight inputs
    reg        data_valid_schedule [0:6][0:3];
    reg        weight_valid_schedule [0:6][0:3];

    // Current cycle inputs to PE array
    reg [15:0] data_in_0, data_in_1, data_in_2, data_in_3;
    reg [7:0]  weight_in_0, weight_in_1, weight_in_2, weight_in_3;
    reg        data_valid_0, data_valid_1, data_valid_2, data_valid_3;
    reg        weight_valid_0, weight_valid_1, weight_valid_2, weight_valid_3;

    // Internal PE interconnects
    wire [15:0] data_h_0_1, data_h_0_2, data_h_0_3, data_h_0_4;
    wire [15:0] data_h_1_1, data_h_1_2, data_h_1_3, data_h_1_4;
    wire [15:0] data_h_2_1, data_h_2_2, data_h_2_3, data_h_2_4;
    wire [15:0] data_h_3_1, data_h_3_2, data_h_3_3, data_h_3_4;

    wire data_valid_h_0_1, data_valid_h_0_2, data_valid_h_0_3, data_valid_h_0_4;
    wire data_valid_h_1_1, data_valid_h_1_2, data_valid_h_1_3, data_valid_h_1_4;
    wire data_valid_h_2_1, data_valid_h_2_2, data_valid_h_2_3, data_valid_h_2_4;
    wire data_valid_h_3_1, data_valid_h_3_2, data_valid_h_3_3, data_valid_h_3_4;

    wire [7:0] weight_v_1_0, weight_v_2_0, weight_v_3_0, weight_v_4_0;
    wire [7:0] weight_v_1_1, weight_v_2_1, weight_v_3_1, weight_v_4_1;
    wire [7:0] weight_v_1_2, weight_v_2_2, weight_v_3_2, weight_v_4_2;
    wire [7:0] weight_v_1_3, weight_v_2_3, weight_v_3_3, weight_v_4_3;

    wire weight_valid_v_1_0, weight_valid_v_2_0, weight_valid_v_3_0, weight_valid_v_4_0;
    wire weight_valid_v_1_1, weight_valid_v_2_1, weight_valid_v_3_1, weight_valid_v_4_1;
    wire weight_valid_v_1_2, weight_valid_v_2_2, weight_valid_v_3_2, weight_valid_v_4_2;
    wire weight_valid_v_1_3, weight_valid_v_2_3, weight_valid_v_3_3, weight_valid_v_4_3;

    wire valid_00, valid_01, valid_02, valid_03;
    wire valid_10, valid_11, valid_12, valid_13;
    wire valid_20, valid_21, valid_22, valid_23;
    wire valid_30, valid_31, valid_32, valid_33;

    // ===========================================
    // FSM State Machine
    // ===========================================

    // State register
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n)
            current_state <= IDLE;
        else
            current_state <= next_state;
    end

    // Next state logic
    always @(*) begin
        case (current_state)
            IDLE: begin
                if (start)
                    next_state = LOAD_DATA;
                else
                    next_state = IDLE;
            end

            LOAD_DATA: begin
                if (cycle_counter == 4'd6)  // 7 cycles (0-6) for loading
                    next_state = COMPUTE;
                else
                    next_state = LOAD_DATA;
            end

            COMPUTE: begin
                if (compute_counter == 4'd7)  // Additional compute cycles
                    next_state = DRAIN;
                else
                    next_state = COMPUTE;
            end

            DRAIN: begin
                if (cycle_counter == 4'd3)  // Drain pipeline
                    next_state = DONE;
                else
                    next_state = DRAIN;
            end

            DONE: begin
                next_state = IDLE;
            end

            default: next_state = IDLE;
        endcase
    end

    // Counter management
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            cycle_counter <= 4'd0;
            compute_counter <= 4'd0;
        end else begin
            case (current_state)
                IDLE: begin
                    cycle_counter <= 4'd0;
                    compute_counter <= 4'd0;
                end

                LOAD_DATA: begin
                    if (cycle_counter < 4'd6)
                        cycle_counter <= cycle_counter + 1;
                end

                COMPUTE: begin
                    cycle_counter <= 4'd0;  // Reset for drain phase
                    if (compute_counter < 4'd7)
                        compute_counter <= compute_counter + 1;
                end

                DRAIN: begin
                    if (cycle_counter < 4'd3)
                        cycle_counter <= cycle_counter + 1;
                end

                DONE: begin
                    cycle_counter <= 4'd0;
                    compute_counter <= 4'd0;
                end
            endcase
        end
    end

    // Control signal generation
    always @(*) begin
        enable_pe = 1'b0;
        clear_accum_pe = 1'b0;
        data_feed_enable = 1'b0;
        weight_feed_enable = 1'b0;

        case (current_state)
            IDLE: begin
                clear_accum_pe = 1'b1;
            end

            LOAD_DATA: begin
                enable_pe = 1'b1;
                data_feed_enable = 1'b1;
                weight_feed_enable = 1'b1;
                if (cycle_counter == 4'd0)
                    clear_accum_pe = 1'b1;
            end

            COMPUTE: begin
                enable_pe = 1'b1;
            end

            DRAIN: begin
                enable_pe = 1'b1;
            end

            DONE: begin
                // Results are ready
            end
        endcase
    end

    // ===========================================
    // Data Scheduling Logic
    // ===========================================

    // Initialize data schedule for systolic loading pattern
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            // Clear all schedules
            data_schedule[0][0] <= 16'd0; data_schedule[0][1] <= 16'd0; data_schedule[0][2] <= 16'd0; data_schedule[0][3] <= 16'd0;
            data_schedule[1][0] <= 16'd0; data_schedule[1][1] <= 16'd0; data_schedule[1][2] <= 16'd0; data_schedule[1][3] <= 16'd0;
            data_schedule[2][0] <= 16'd0; data_schedule[2][1] <= 16'd0; data_schedule[2][2] <= 16'd0; data_schedule[2][3] <= 16'd0;
            data_schedule[3][0] <= 16'd0; data_schedule[3][1] <= 16'd0; data_schedule[3][2] <= 16'd0; data_schedule[3][3] <= 16'd0;
            data_schedule[4][0] <= 16'd0; data_schedule[4][1] <= 16'd0; data_schedule[4][2] <= 16'd0; data_schedule[4][3] <= 16'd0;
            data_schedule[5][0] <= 16'd0; data_schedule[5][1] <= 16'd0; data_schedule[5][2] <= 16'd0; data_schedule[5][3] <= 16'd0;
            data_schedule[6][0] <= 16'd0; data_schedule[6][1] <= 16'd0; data_schedule[6][2] <= 16'd0; data_schedule[6][3] <= 16'd0;
        end else if (start) begin
            // Systolic loading pattern for Matrix A
            // Cycle 0
            data_schedule[0][0] <= matrix_a_00; data_schedule[0][1] <= 16'd0;      data_schedule[0][2] <= 16'd0;      data_schedule[0][3] <= 16'd0;
            // Cycle 1
            data_schedule[1][0] <= matrix_a_01; data_schedule[1][1] <= matrix_a_10; data_schedule[1][2] <= 16'd0;      data_schedule[1][3] <= 16'd0;
            // Cycle 2
            data_schedule[2][0] <= matrix_a_02; data_schedule[2][1] <= matrix_a_11; data_schedule[2][2] <= matrix_a_20; data_schedule[2][3] <= 16'd0;
            // Cycle 3
            data_schedule[3][0] <= matrix_a_03; data_schedule[3][1] <= matrix_a_12; data_schedule[3][2] <= matrix_a_21; data_schedule[3][3] <= matrix_a_30;
            // Cycle 4
            data_schedule[4][0] <= 16'd0;      data_schedule[4][1] <= matrix_a_13; data_schedule[4][2] <= matrix_a_22; data_schedule[4][3] <= matrix_a_31;
            // Cycle 5
            data_schedule[5][0] <= 16'd0;      data_schedule[5][1] <= 16'd0;      data_schedule[5][2] <= matrix_a_23; data_schedule[5][3] <= matrix_a_32;
            // Cycle 6
            data_schedule[6][0] <= 16'd0;      data_schedule[6][1] <= 16'd0;      data_schedule[6][2] <= 16'd0;      data_schedule[6][3] <= matrix_a_33;

            // Weight scheduling pattern for Matrix B
            weight_schedule[0][0] <= matrix_b_00; weight_schedule[0][1] <= 8'd0;       weight_schedule[0][2] <= 8'd0;       weight_schedule[0][3] <= 8'd0;
            weight_schedule[1][0] <= matrix_b_10; weight_schedule[1][1] <= matrix_b_01; weight_schedule[1][2] <= 8'd0;       weight_schedule[1][3] <= 8'd0;
            weight_schedule[2][0] <= matrix_b_20; weight_schedule[2][1] <= matrix_b_11; weight_schedule[2][2] <= matrix_b_02; weight_schedule[2][3] <= 8'd0;
            weight_schedule[3][0] <= matrix_b_30; weight_schedule[3][1] <= matrix_b_21; weight_schedule[3][2] <= matrix_b_12; weight_schedule[3][3] <= matrix_b_03;
            weight_schedule[4][0] <= 8'd0;       weight_schedule[4][1] <= matrix_b_31; weight_schedule[4][2] <= matrix_b_22; weight_schedule[4][3] <= matrix_b_13;
            weight_schedule[5][0] <= 8'd0;       weight_schedule[5][1] <= 8'd0;       weight_schedule[5][2] <= matrix_b_32; weight_schedule[5][3] <= matrix_b_23;
            weight_schedule[6][0] <= 8'd0;       weight_schedule[6][1] <= 8'd0;       weight_schedule[6][2] <= 8'd0;       weight_schedule[6][3] <= matrix_b_33;

            // Valid signals
            data_valid_schedule[0][0] <= 1'b1; data_valid_schedule[0][1] <= 1'b0; data_valid_schedule[0][2] <= 1'b0; data_valid_schedule[0][3] <= 1'b0;
            data_valid_schedule[1][0] <= 1'b1; data_valid_schedule[1][1] <= 1'b1; data_valid_schedule[1][2] <= 1'b0; data_valid_schedule[1][3] <= 1'b0;
            data_valid_schedule[2][0] <= 1'b1; data_valid_schedule[2][1] <= 1'b1; data_valid_schedule[2][2] <= 1'b1; data_valid_schedule[2][3] <= 1'b0;
            data_valid_schedule[3][0] <= 1'b1; data_valid_schedule[3][1] <= 1'b1; data_valid_schedule[3][2] <= 1'b1; data_valid_schedule[3][3] <= 1'b1;
            data_valid_schedule[4][0] <= 1'b0; data_valid_schedule[4][1] <= 1'b1; data_valid_schedule[4][2] <= 1'b1; data_valid_schedule[4][3] <= 1'b1;
            data_valid_schedule[5][0] <= 1'b0; data_valid_schedule[5][1] <= 1'b0; data_valid_schedule[5][2] <= 1'b1; data_valid_schedule[5][3] <= 1'b1;
            data_valid_schedule[6][0] <= 1'b0; data_valid_schedule[6][1] <= 1'b0; data_valid_schedule[6][2] <= 1'b0; data_valid_schedule[6][3] <= 1'b1;

            weight_valid_schedule[0][0] <= 1'b1; weight_valid_schedule[0][1] <= 1'b0; weight_valid_schedule[0][2] <= 1'b0; weight_valid_schedule[0][3] <= 1'b0;
            weight_valid_schedule[1][0] <= 1'b1; weight_valid_schedule[1][1] <= 1'b1; weight_valid_schedule[1][2] <= 1'b0; weight_valid_schedule[1][3] <= 1'b0;
            weight_valid_schedule[2][0] <= 1'b1; weight_valid_schedule[2][1] <= 1'b1; weight_valid_schedule[2][2] <= 1'b1; weight_valid_schedule[2][3] <= 1'b0;
            weight_valid_schedule[3][0] <= 1'b1; weight_valid_schedule[3][1] <= 1'b1; weight_valid_schedule[3][2] <= 1'b1; weight_valid_schedule[3][3] <= 1'b1;
            weight_valid_schedule[4][0] <= 1'b0; weight_valid_schedule[4][1] <= 1'b1; weight_valid_schedule[4][2] <= 1'b1; weight_valid_schedule[4][3] <= 1'b1;
            weight_valid_schedule[5][0] <= 1'b0; weight_valid_schedule[5][1] <= 1'b0; weight_valid_schedule[5][2] <= 1'b1; weight_valid_schedule[5][3] <= 1'b1;
            weight_valid_schedule[6][0] <= 1'b0; weight_valid_schedule[6][1] <= 1'b0; weight_valid_schedule[6][2] <= 1'b0; weight_valid_schedule[6][3] <= 1'b1;
        end
    end

    // Current cycle data/weight selection
    always @(*) begin
        if (data_feed_enable && cycle_counter <= 4'd6) begin
            data_in_0 = data_schedule[cycle_counter][0];
            data_in_1 = data_schedule[cycle_counter][1];
            data_in_2 = data_schedule[cycle_counter][2];
            data_in_3 = data_schedule[cycle_counter][3];

            weight_in_0 = weight_schedule[cycle_counter][0];
            weight_in_1 = weight_schedule[cycle_counter][1];
            weight_in_2 = weight_schedule[cycle_counter][2];
            weight_in_3 = weight_schedule[cycle_counter][3];

            data_valid_0 = data_valid_schedule[cycle_counter][0];
            data_valid_1 = data_valid_schedule[cycle_counter][1];
            data_valid_2 = data_valid_schedule[cycle_counter][2];
            data_valid_3 = data_valid_schedule[cycle_counter][3];

            weight_valid_0 = weight_valid_schedule[cycle_counter][0];
            weight_valid_1 = weight_valid_schedule[cycle_counter][1];
            weight_valid_2 = weight_valid_schedule[cycle_counter][2];
            weight_valid_3 = weight_valid_schedule[cycle_counter][3];
        end else begin
            data_in_0 = 16'd0;    data_in_1 = 16'd0;    data_in_2 = 16'd0;    data_in_3 = 16'd0;
            weight_in_0 = 8'd0;   weight_in_1 = 8'd0;   weight_in_2 = 8'd0;   weight_in_3 = 8'd0;
            data_valid_0 = 1'b0;  data_valid_1 = 1'b0;  data_valid_2 = 1'b0;  data_valid_3 = 1'b0;
            weight_valid_0 = 1'b0; weight_valid_1 = 1'b0; weight_valid_2 = 1'b0; weight_valid_3 = 1'b0;
        end
    end

    // ===========================================
    // 4×4 Processing Element Array
    // ===========================================

    // Row 0
    processing_element pe_00 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_in_0), .data_valid_in(data_valid_0),
        .data_out(data_h_0_1), .data_valid_out(data_valid_h_0_1),
        .weight_in(weight_in_0), .weight_valid_in(weight_valid_0),
        .weight_out(weight_v_1_0), .weight_valid_out(weight_valid_v_1_0),
        .accum_out(result_00), .result_valid(valid_00)
    );

    processing_element pe_01 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_0_1), .data_valid_in(data_valid_h_0_1),
        .data_out(data_h_0_2), .data_valid_out(data_valid_h_0_2),
        .weight_in(weight_in_1), .weight_valid_in(weight_valid_1),
        .weight_out(weight_v_1_1), .weight_valid_out(weight_valid_v_1_1),
        .accum_out(result_01), .result_valid(valid_01)
    );

    processing_element pe_02 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_0_2), .data_valid_in(data_valid_h_0_2),
        .data_out(data_h_0_3), .data_valid_out(data_valid_h_0_3),
        .weight_in(weight_in_2), .weight_valid_in(weight_valid_2),
        .weight_out(weight_v_1_2), .weight_valid_out(weight_valid_v_1_2),
        .accum_out(result_02), .result_valid(valid_02)
    );

    processing_element pe_03 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_0_3), .data_valid_in(data_valid_h_0_3),
        .data_out(data_h_0_4), .data_valid_out(data_valid_h_0_4),
        .weight_in(weight_in_3), .weight_valid_in(weight_valid_3),
        .weight_out(weight_v_1_3), .weight_valid_out(weight_valid_v_1_3),
        .accum_out(result_03), .result_valid(valid_03)
    );

    // Row 1
    processing_element pe_10 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_in_1), .data_valid_in(data_valid_1),
        .data_out(data_h_1_1), .data_valid_out(data_valid_h_1_1),
        .weight_in(weight_v_1_0), .weight_valid_in(weight_valid_v_1_0),
        .weight_out(weight_v_2_0), .weight_valid_out(weight_valid_v_2_0),
        .accum_out(result_10), .result_valid(valid_10)
    );

    processing_element pe_11 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_1_1), .data_valid_in(data_valid_h_1_1),
        .data_out(data_h_1_2), .data_valid_out(data_valid_h_1_2),
        .weight_in(weight_v_1_1), .weight_valid_in(weight_valid_v_1_1),
        .weight_out(weight_v_2_1), .weight_valid_out(weight_valid_v_2_1),
        .accum_out(result_11), .result_valid(valid_11)
    );

    processing_element pe_12 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_1_2), .data_valid_in(data_valid_h_1_2),
        .data_out(data_h_1_3), .data_valid_out(data_valid_h_1_3),
        .weight_in(weight_v_1_2), .weight_valid_in(weight_valid_v_1_2),
        .weight_out(weight_v_2_2), .weight_valid_out(weight_valid_v_2_2),
        .accum_out(result_12), .result_valid(valid_12)
    );

    processing_element pe_13 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_1_3), .data_valid_in(data_valid_h_1_3),
        .data_out(data_h_1_4), .data_valid_out(data_valid_h_1_4),
        .weight_in(weight_v_1_3), .weight_valid_in(weight_valid_v_1_3),
        .weight_out(weight_v_2_3), .weight_valid_out(weight_valid_v_2_3),
        .accum_out(result_13), .result_valid(valid_13)
    );

    // Row 2
    processing_element pe_20 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_in_2), .data_valid_in(data_valid_2),
        .data_out(data_h_2_1), .data_valid_out(data_valid_h_2_1),
        .weight_in(weight_v_2_0), .weight_valid_in(weight_valid_v_2_0),
        .weight_out(weight_v_3_0), .weight_valid_out(weight_valid_v_3_0),
        .accum_out(result_20), .result_valid(valid_20)
    );

    processing_element pe_21 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_2_1), .data_valid_in(data_valid_h_2_1),
        .data_out(data_h_2_2), .data_valid_out(data_valid_h_2_2),
        .weight_in(weight_v_2_1), .weight_valid_in(weight_valid_v_2_1),
        .weight_out(weight_v_3_1), .weight_valid_out(weight_valid_v_3_1),
        .accum_out(result_21), .result_valid(valid_21)
    );

    processing_element pe_22 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_2_2), .data_valid_in(data_valid_h_2_2),
        .data_out(data_h_2_3), .data_valid_out(data_valid_h_2_3),
        .weight_in(weight_v_2_2), .weight_valid_in(weight_valid_v_2_2),
        .weight_out(weight_v_3_2), .weight_valid_out(weight_valid_v_3_2),
        .accum_out(result_22), .result_valid(valid_22)
    );

    processing_element pe_23 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_2_3), .data_valid_in(data_valid_h_2_3),
        .data_out(data_h_2_4), .data_valid_out(data_valid_h_2_4),
        .weight_in(weight_v_2_3), .weight_valid_in(weight_valid_v_2_3),
        .weight_out(weight_v_3_3), .weight_valid_out(weight_valid_v_3_3),
        .accum_out(result_23), .result_valid(valid_23)
    );

    // Row 3
    processing_element pe_30 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_in_3), .data_valid_in(data_valid_3),
        .data_out(data_h_3_1), .data_valid_out(data_valid_h_3_1),
        .weight_in(weight_v_3_0), .weight_valid_in(weight_valid_v_3_0),
        .weight_out(weight_v_4_0), .weight_valid_out(weight_valid_v_4_0),
        .accum_out(result_30), .result_valid(valid_30)
    );

    processing_element pe_31 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_3_1), .data_valid_in(data_valid_h_3_1),
        .data_out(data_h_3_2), .data_valid_out(data_valid_h_3_2),
        .weight_in(weight_v_3_1), .weight_valid_in(weight_valid_v_3_1),
        .weight_out(weight_v_4_1), .weight_valid_out(weight_valid_v_4_1),
        .accum_out(result_31), .result_valid(valid_31)
    );

    processing_element pe_32 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_3_2), .data_valid_in(data_valid_h_3_2),
        .data_out(data_h_3_3), .data_valid_out(data_valid_h_3_3),
        .weight_in(weight_v_3_2), .weight_valid_in(weight_valid_v_3_2),
        .weight_out(weight_v_4_2), .weight_valid_out(weight_valid_v_4_2),
        .accum_out(result_32), .result_valid(valid_32)
    );

    processing_element pe_33 (
        .clk(clk), .rst_n(rst_n), .enable(enable_pe), .clear_accum(clear_accum_pe),
        .data_in(data_h_3_3), .data_valid_in(data_valid_h_3_3),
        .data_out(data_h_3_4), .data_valid_out(data_valid_h_3_4),
        .weight_in(weight_v_3_3), .weight_valid_in(weight_valid_v_3_3),
        .weight_out(weight_v_4_3), .weight_valid_out(weight_valid_v_4_3),
        .accum_out(result_33), .result_valid(valid_33)
    );

    // ===========================================
    // Output Control
    // ===========================================
    assign computation_done = (current_state == DONE);
    assign result_valid = (current_state == DONE);

endmodule

// ===========================================
// Processing Element (Same as before)
// ===========================================
module processing_element #(
    parameter DATA_WIDTH = 16,
    parameter WEIGHT_WIDTH = 8,
    parameter ACCUM_WIDTH = 32
)(
    input                         clk,
    input                         rst_n,
    input                         enable,
    input                         clear_accum,

    input  [DATA_WIDTH-1:0]       data_in,
    input                         data_valid_in,
    output [DATA_WIDTH-1:0]       data_out,
    output                        data_valid_out,

    input  [WEIGHT_WIDTH-1:0]     weight_in,
    input                         weight_valid_in,
    output [WEIGHT_WIDTH-1:0]     weight_out,
    output                        weight_valid_out,

    output [ACCUM_WIDTH-1:0]      accum_out,
    output                        result_valid
);

    wire [DATA_WIDTH-1:0]      mac_data;
    wire [WEIGHT_WIDTH-1:0]    mac_weight;
    wire [ACCUM_WIDTH-1:0]     mac_accum;
    wire                       mac_valid;

    reg [DATA_WIDTH-1:0]       data_reg;
    reg                        data_valid_reg;
    reg [WEIGHT_WIDTH-1:0]     weight_reg;
    reg                        weight_valid_reg;

    // MAC unit gets valid data/weight only when both are valid
    assign mac_data = (data_valid_in && weight_valid_in) ? data_in : {DATA_WIDTH{1'b0}};
    assign mac_weight = (data_valid_in && weight_valid_in) ? weight_in : {WEIGHT_WIDTH{1'b0}};

    // Register data and weights for systolic flow
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            data_reg <= {DATA_WIDTH{1'b0}};
            data_valid_reg <= 1'b0;
            weight_reg <= {WEIGHT_WIDTH{1'b0}};
            weight_valid_reg <= 1'b0;
        end else if (enable) begin
            data_reg <= data_in;
            data_valid_reg <= data_valid_in;
            weight_reg <= weight_in;
            weight_valid_reg <= weight_valid_in;
        end else begin
            data_valid_reg <= 1'b0;
            weight_valid_reg <= 1'b0;
        end
    end

    // Output registered values for systolic flow
    assign data_out = data_reg;
    assign data_valid_out = data_valid_reg;
    assign weight_out = weight_reg;
    assign weight_valid_out = weight_valid_reg;

    // MAC unit instantiation
    mac_unit_basic #(
        .DATA_WIDTH(DATA_WIDTH),
        .WEIGHT_WIDTH(WEIGHT_WIDTH),
        .ACCUM_WIDTH(ACCUM_WIDTH)
    ) mac_unit (
        .clk(clk),
        .rst_n(rst_n),
        .enable(enable && data_valid_in && weight_valid_in),
        .clear_accum(clear_accum),
        .data_in(mac_data),
        .weight_in(mac_weight),
        .accum_out(mac_accum),
        .valid_out(mac_valid)
    );

    assign accum_out = mac_accum;
    assign result_valid = mac_valid;

endmodule

// ===========================================
// MAC Unit (Same as before)
// ===========================================
module mac_unit_basic #(
    parameter DATA_WIDTH = 16,
    parameter WEIGHT_WIDTH = 8,
    parameter ACCUM_WIDTH = 32
)(
    input                         clk,
    input                         rst_n,
    input                         enable,
    input                         clear_accum,

    input  [DATA_WIDTH-1:0]       data_in,
    input  [WEIGHT_WIDTH-1:0]     weight_in,

    output [ACCUM_WIDTH-1:0]      accum_out,
    output                        valid_out
);

    wire signed [DATA_WIDTH-1:0]      data_signed;
    wire signed [WEIGHT_WIDTH-1:0]    weight_signed;
    wire signed [DATA_WIDTH+WEIGHT_WIDTH-1:0] mult_result;
    reg signed [ACCUM_WIDTH-1:0]      accum_reg;
    wire signed [ACCUM_WIDTH-1:0]     next_accum;
    reg                               valid_out_reg;

    // Convert to signed for arithmetic
    assign data_signed = $signed(data_in);
    assign weight_signed = $signed(weight_in);
    assign mult_result = data_signed * weight_signed;

    // Accumulator logic: clear or accumulate
    assign next_accum = clear_accum ? mult_result : (accum_reg + mult_result);

    // Accumulator register
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            accum_reg <= {ACCUM_WIDTH{1'b0}};
            valid_out_reg <= 1'b0;
        end else if (enable) begin
            accum_reg <= next_accum;
            valid_out_reg <= 1'b1;
        end else begin
            valid_out_reg <= 1'b0;
        end
    end

    assign accum_out = accum_reg;
    assign valid_out = valid_out_reg;

endmodule


### Setting up the configuration

OpenLane requries you to configure any Flow before using it. This is done using
the `config` module.

For colaboratories, REPLs and other interactive environments where there is no
concrete Flow object, the Configuration may be initialized using `Config.interactive`,
which will automatically propagate the configuration to any future steps.

You can find the documentation for `Config.interactive` [here](https://openlane2.readthedocs.io/en/latest/reference/api/config/index.html#openlane.config.Config.interactive).



In [None]:
from openlane.config import Config

Config.interactive(
    "spm",
    PDK="sky130A",
    CLOCK_PORT="clk",
    CLOCK_NET="clk",
    CLOCK_PERIOD=10,
    PRIMARY_GDSII_STREAMOUT_TOOL="klayout",
)
{
  "DESIGN_NAME": "systolic_array_4x4",
  "VERILOG_FILES": ["dir::systolic_array_4x4.v"],
  "CLOCK_PERIOD": 50,
  "CLOCK_PORT": "clk"
}

### Running implementation steps

There are two ways to obtain OpenLane's built-in implementation steps:

* via directly importing from the `steps` module using its category:
    * `from openlane.steps import Yosys` then `Synthesis = Yosys.Synthesis`
* by using the step's id from the registry:
    * `from openlane.steps import Step` then `Synthesis = Step.factory.get("Yosys.Synthesis")`

You can find a full list of included steps here: https://openlane2.readthedocs.io/en/latest/reference/step_config_vars.html

In [None]:
from openlane.steps import Step

* First, get the step (and display its help)...

In [None]:
Synthesis = Step.factory.get("Yosys.Synthesis")

Synthesis.display_help()

* Then run it. Note you can pass step-specific configs using Python keyword
  arguments.

### Synthesis

We need to start by converting our high-level Verilog to one that just shows
the connections between small silicon patterns called "standard cells" in process
called Synthesis. We can do this by passing the Verilog files as a configuration
variable to `Yosys.Synthesis` as follows, then running it.

As this is the first step, we need to create an empty state and pass it to it.

In [None]:
from openlane.state import State

synthesis = Synthesis(
    VERILOG_FILES=["./spm.v"],
    state_in=State(),
)
synthesis.start()

In [None]:
display(synthesis)

### Floorplanning

Floorplanning does two things:

* Determines the dimensions of the final chip.
* Creates the "cell placement grid" which placed cells must be aligned to.
    * Each cell in the grid is called a "site." Cells can occupy multiple
      sites, with the overwhelming majority of cells occupying multiple sites
      by width, and some standard cell libraries supporting varying heights as well.

> Don't forget- you may call `display_help()` on any Step class to get a full
> list of configuration variables.


In [None]:
Floorplan = Step.factory.get("OpenROAD.Floorplan")

floorplan = Floorplan(state_in=synthesis.state_out)
floorplan.start()

In [None]:
display(floorplan)

### Tap/Endcap Cell Insertion

This places two kinds of cells on the floorplan:

* End cap/boundary cells: Added at the beginning and end of each row. True to
  their name, they "cap off" the core area of a design.
* Tap cells: Placed in a polka dot-ish fashion across the rows. Tap cells
  connect VDD to the nwell and the psubstrate to VSS, which the majority of cells
  do not do themselves to save area- but if you go long enough without one such
  connection you end up with the cell "latching-up"; i.e.; refusing to switch
  back to LO from HI.

  There is a maximum distance between tap cells enforced as part of every
  foundry process.

In [None]:
TapEndcapInsertion = Step.factory.get("OpenROAD.TapEndcapInsertion")

tdi = TapEndcapInsertion(state_in=floorplan.state_out)
tdi.start()

In [None]:
display(tdi)

### I/O Placement

This places metal pins at the edges of the design corresponding to the top level
inputs and outputs for your design. These pins act as the interface with other
designs when you integrate it with other designs.

In [None]:
IOPlacement = Step.factory.get("OpenROAD.IOPlacement")

ioplace = IOPlacement(state_in=tdi.state_out)
ioplace.start()

In [None]:
display(ioplace)

### Generating the Power Distribution Network (PDN)

This creates the power distribution network for your design, which is essentially
a plaid pattern of horizontal and vertical "straps" across the design that is
then connected to the rails' VDD and VSS (via the tap cells.)

You can find an explanation of how the power distribution network works at this
link: https://openlane2.readthedocs.io/en/latest/usage/hardening_macros.html#pdn-generation

While we typically don't need to mess with the PDN too much, the SPM is a small
design, so we're going to need to make the plaid pattern formed by the PDN a bit
smaller.

In [None]:
GeneratePDN = Step.factory.get("OpenROAD.GeneratePDN")

pdn = GeneratePDN(
    state_in=ioplace.state_out,
    FP_PDN_VWIDTH=2,
    FP_PDN_HWIDTH=2,
    FP_PDN_VPITCH=30,
    FP_PDN_HPITCH=30,
)
pdn.start()

In [None]:
display(pdn)

### Global Placement

Global Placement is deciding on a fuzzy, non-final location for each of the cells,
with the aim of minimizing the distance between cells that are connected
together (more specifically, the total length of the not-yet-created wires that
will connect them).

As you will see in the `.display()` in the second cell below, the placement is
considered "illegal", i.e., not properly aligned with the cell placement grid.
This is addressed by "Detailed Placement", also referred to as "placement
legalization", which is the next step.

In [None]:
GlobalPlacement = Step.factory.get("OpenROAD.GlobalPlacement")

gpl = GlobalPlacement(state_in=pdn.state_out)
gpl.start()

In [None]:
display(gpl)

### Detailed Placement

This aligns the fuzzy placement from before with the grid, "legalizing" it.

In [None]:
DetailedPlacement = Step.factory.get("OpenROAD.DetailedPlacement")

dpl = DetailedPlacement(state_in=gpl.state_out)
dpl.start()

In [None]:
display(dpl)

### Clock Tree Synthesis (CTS)

With the cells now having a final placement, we can go ahead and create what
is known as the clock tree, i.e., the hierarchical set of buffers used
for clock signal to minimize what is known as "clock skew"- variable delay
of the clock cycle from register to register because of factors such as metal
wire length, clock load (number of gates connected to the same clock buffer,)
et cetera.

The CTS step creates the cells and places the between the gaps in the detailed
placement above.

In [None]:
CTS = Step.factory.get("OpenROAD.CTS")

cts = CTS(state_in=dpl.state_out)
cts.start()

In [None]:
display(cts)

### Global Routing

Global routing "plans" the routes the wires between two gates (or gates and
I/O pins/the PDN) will take. The results of global routing (which are called
"routing guides") are stored in internal data structures and have no effect on
the actual design, so there is no `display()` statement.

In [None]:
GlobalRouting = Step.factory.get("OpenROAD.GlobalRouting")

grt = GlobalRouting(state_in=cts.state_out)
grt.start()

### Detailed Routing

Detailed routing uses the guides from Global Routing to actually create wires
on the metal layers and connect the gates, making the connections finally physical.

This is typically the longest step in the flow.

In [None]:
DetailedRouting = Step.factory.get("OpenROAD.DetailedRouting")

drt = DetailedRouting(state_in=grt.state_out)
drt.start()

In [None]:
display(drt)

### Fill Insertion

Finally, as we're done placing all the essential cells, the only thing left to
do is fill in the gaps.

We prioritize the use of decap (decoupling capacitor) cells, which
further supports the power distribution network, but when there aren't any
small enough cells, we just use regular fill cells.

In [None]:
FillInsertion = Step.factory.get("OpenROAD.FillInsertion")

fill = FillInsertion(state_in=drt.state_out)
fill.start()

In [None]:
display(fill)

### Parasitics Extraction a.k.a. Resistance/Capacitance Extraction (RCX)

This step does not alter the design- rather, it computes the
[Parasitic elements](https://en.wikipedia.org/wiki/Parasitic_element_(electrical_networks))
of the circuit, which have an effect of timing, as we prepare to do the final
timing analysis.

The parasitic elements are saved in the **Standard Parasitics Exchange Format**,
or SPEF. OpenLane creates a SPEF file for each interconnect corner as described in
the [Corners and STA](https://openlane2.readthedocs.io/en/latest/usage/corners_and_sta.html)
section of the documentation.

In [None]:
RCX = Step.factory.get("OpenROAD.RCX")

rcx = RCX(state_in=fill.state_out)
rcx.start()

### Static Timing Analysis (Post-PnR)

STA is a process that verifies that a chip meets certain constraints on clock
and data timings to run at its rated clock speed. See [Corners and STA](https://openlane2.readthedocs.io/en/latest/usage/corners_and_sta.html)
in the documentation for more info.

---

This step generates two kinds of files:
* `.lib`: Liberty™-compatible Library files. Can be used to do static timing
  analysis when creating a design with this design as a sub-macro.
* `.sdf`: Standard Delay Format. Can be used with certain simulation software
  to do *dynamic* timing analysis.

Unfortunately, the `.lib` files coming out of OpenLane right now are not super
reliable for timing purposes and are only provided for completeness.
When using OpenLane-created macros withing other designs, it is best to use the
macro's final netlist and extracted parasitics instead.

In [None]:
STAPostPNR = Step.factory.get("OpenROAD.STAPostPNR")

sta_post_pnr = STAPostPNR(state_in=rcx.state_out)
sta_post_pnr.start()

### Stream-out

Stream-out is the process of converting the designs from the abstract formats
using during floorplanning, placement and routing into a concrete format called
GDSII (lit. Graphic Design System 2), which is the final file that is then sent
for fabrication.

In [None]:
StreamOut = Step.factory.get("KLayout.StreamOut")

gds = StreamOut(state_in=sta_post_pnr.state_out)
gds.start()

In [None]:
display(gds)

### Design Rule Checks (DRC)

DRC determines that the final layout does not violate any of the rules set by
the foundry to ensure the design is actually manufacturable- for example,
not enough space between two wires, *too much* space between tap cells, and so
on.

A design not passing DRC will typically be rejected by the foundry, who
also run DRC on their side.

In [None]:
DRC = Step.factory.get("Magic.DRC")

drc = DRC(state_in=gds.state_out)
drc.start()

### SPICE Extraction for Layout vs. Schematic Check

This step tries to reconstruct a SPICE netlist from the GDSII file, so it can
later be used for the **Layout vs. Schematic** (LVS) check.

In [None]:
SpiceExtraction = Step.factory.get("Magic.SpiceExtraction")

spx = SpiceExtraction(state_in=drc.state_out)
spx.start()

### Layout vs. Schematic (LVS)

A comparison between the final Verilog netlist (from PnR) and the final
SPICE netlist (extracted.)

This check effectively compares the physically implemented circuit to the final
Verilog netlist output by OpenROAD.

The idea is, if there are any disconnects, shorts or other mismatches in the
physical implementation that do not exist in the logical view of the design,
they would be caught at this step.

Common issues that result in LVS violations include:
* Lack of fill cells or tap cells in the design
* Two unrelated signals to be shorted, or a wire to be disconnected (most
  commonly seen with misconfigured PDN)

Chips with LVS errors are typically dead on arrival.

In [None]:
LVS = Step.factory.get("Netgen.LVS")

lvs = LVS(state_in=spx.state_out)
lvs.start()