Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Softmax layer latency #20

Closed
jmduarte opened this issue Nov 17, 2017 · 2 comments
Closed

Softmax layer latency #20

jmduarte opened this issue Nov 17, 2017 · 2 comments

Comments

@jmduarte
Copy link
Member

Using the branch nt/resource-reuse-api I checked what the latency and resource usage is for the 3-layer model with two ReuseFactor test cases (below). In either case the softmax layer takes 34 clocks and I was going to check the code to see if this is expected.

ReuseFactor: 1

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+----------+
    |  Latency  |  Interval | Pipeline |
    | min | max | min | max |   Type   |
    +-----+-----+-----+-----+----------+
    |   59|   59|    1|    1| dataflow |
    +-----+-----+-----+-----+----------+

    + Detail: 
        * Instance: 
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |                                        |                       |  Latency  |  Interval | Pipeline |
        |                Instance                |         Module        | min | max | min | max |   Type   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |grp_compute_layer_0_0_0_2_fu_440        |compute_layer_0_0_0_2  |    5|    5|    1|    1| function |
        |grp_compute_layer_0_0_0_1_fu_508        |compute_layer_0_0_0_1  |    4|    4|    1|    1| function |
        |grp_compute_layer_0_0_0_3_fu_539        |compute_layer_0_0_0_3  |    4|    4|    1|    1| function |
        |grp_softmax_fu_575                      |softmax                |   34|   34|    1|    1| function |
        |grp_compute_layer_0_0_0_s_fu_587        |compute_layer_0_0_0_s  |    3|    3|    1|    1| function |
        |call_ret2_relu_2_fu_623                 |relu_2                 |    0|    0|    1|    1| function |
        |call_ret4_relu_1_fu_691                 |relu_1                 |    0|    0|    1|    1| function |
        |call_ret_relu_fu_727                    |relu                   |    0|    0|    1|    1| function |
        |StgValue_114_myproject_entry3_fu_763    |myproject_entry3       |    0|    0|    0|    0|   none   |
        |StgValue_115_myproject_entry490_fu_848  |myproject_entry490     |    0|    0|    0|    0|   none   |
        |StgValue_572_Block_proc_fu_906          |Block_proc             |    0|    0|    0|    0|   none   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+

ReuseFactor: 4

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+----------+
    |  Latency  |  Interval | Pipeline |
    | min | max | min | max |   Type   |
    +-----+-----+-----+-----+----------+
    |   69|   69|    4|    4| dataflow |
    +-----+-----+-----+-----+----------+

    + Detail: 
        * Instance: 
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |                                        |                       |  Latency  |  Interval | Pipeline |
        |                Instance                |         Module        | min | max | min | max |   Type   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |grp_compute_layer_0_0_0_2_fu_440        |compute_layer_0_0_0_2  |    6|    6|    3|    3| function |
        |grp_compute_layer_0_0_0_3_fu_508        |compute_layer_0_0_0_3  |    6|    6|    3|    3| function |
        |grp_compute_layer_0_0_0_s_fu_539        |compute_layer_0_0_0_s  |    7|    7|    4|    4| function |
        |grp_softmax_fu_575                      |softmax                |   34|   34|    1|    1| function |
        |grp_compute_layer_0_0_0_1_fu_587        |compute_layer_0_0_0_1  |    7|    7|    4|    4| function |
        |call_ret2_relu_fu_623                   |relu                   |    0|    0|    1|    1| function |
        |call_ret4_relu_2_fu_691                 |relu_2                 |    0|    0|    1|    1| function |
        |call_ret_relu_1_fu_727                  |relu_1                 |    0|    0|    1|    1| function |
        |StgValue_125_myproject_entry3_fu_763    |myproject_entry3       |    0|    0|    0|    0|   none   |
        |StgValue_126_myproject_entry505_fu_848  |myproject_entry505     |    0|    0|    0|    0|   none   |
        |StgValue_593_Block_proc_fu_906          |Block_proc             |    0|    0|    0|    0|   none   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
@benjaminkreis
Copy link
Member

benjaminkreis commented Nov 17, 2017

This is expected. With <16,6> fixed point, I found that it adds 30 clocks of latency. With <32,8>, it adds 60 clocks. Division is slow :(

@benjaminkreis
Copy link
Member

I think we can close this for now. If we want to do something similar without actually doing full softmax, @ejk43 had some ideas (e.g. find the maximum).

GiuseppeDiGuglielmo pushed a commit that referenced this issue Oct 13, 2023
Branch to test non-streaming relu activation function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants