Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Winograd matrices computation. #3553

Merged
merged 1 commit into from Jul 25, 2019
Merged

Add Winograd matrices computation. #3553

merged 1 commit into from Jul 25, 2019

Conversation

cbalint13
Copy link
Contributor

@cbalint13 cbalint13 commented Jul 16, 2019

This patch adds Winograd (Cook-Toom) matrices computation routines as described by A. Lavin & S. Gray based on original wincnn implementation of Andrew Lavin. Also removes actual hard coded redundant matrices.

It opens the door to:

  • Other kernel_sizes than (3x3), also more tile_sizes (1 < any < 9)
  • Allow adjusting interpolation points for better precision see discussion 1
  • Allow custom range scaling for better precision see discussion 1 and 2
  • Scaling ranges for quantized ops (int8), see: Efficient Winograd Convolution via Integer Arithmetic

Once this PR pass, will attempt with more winograd enhancements.

@cbalint13
Copy link
Contributor Author

cbalint13 commented Jul 17, 2019

@merrymercy @ZihengJiang @masahi @Laurawly ,
Please help with the review.

@@ -333,54 +332,11 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still check kernel size=3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, HPAD / WPAD shouldn't be checked too. Consider, we have no padding.

Copy link
Contributor Author

@cbalint13 cbalint13 Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinthesun , @FrozenGene ,

  • HPAD,WPAD asserts removed but assert KW == KH instead (in all places)
  • {KW, KH} == {3,3} assert removed but assert (kernel_size > 1) (in winograd_utils.py)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the assertion regrading to whether winograd is valid to the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinthesun ,

  1. In one/same we guard winograd matrix related (in)validity, already done here
    Taking it out from there means to maintain it in multiple unrelated places (each conv2d files).

  2. For other (in)valid things (padding, stride, dilation etc) pure conv related things is good where they are already now (separate dedicated conv2d.py files for targeted arches: arm_cpu, mali, cuda, etc).

I would keep it simple with this PR, intention for now is to only replace the winograd matrices, let's not touch functionalities or other actual limitations too much.

I'll keep in mind the suggestions here, next PR will be about enhancement around cfg.space (especialy arm/mali), then will be good time to revisit limitations.

@tqchen
Copy link
Member

tqchen commented Jul 17, 2019

cc @ajtulloch

@ajtulloch
Copy link
Contributor

The specific choices of the interpolation points substantially changes the approximations - have you validated that the proposed routines don't affect accuracy vs the tuned ones we currently have? cc @zlateski who was looking at this stuff recently.

Copy link

@zlateski zlateski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some hard-coding will be unavoidable :)

topi/python/topi/nn/winograd_util.py Show resolved Hide resolved
@cbalint13
Copy link
Contributor Author

cbalint13 commented Jul 18, 2019

The specific choices of the interpolation points substantially changes the approximations - have you validated that the proposed routines don't affect accuracy vs the tuned ones we currently have? cc @zlateski who was looking at this stuff recently.

@ajtulloch ,

Some clarifications:

SHORT: We replace actual hardcoded with same ones but this time are generated.

LONG:

  • Don't know from where original hard coded ones comes exactly (folks, please inform me), but for booth of them with actual tile_size = {2, 4} we replace with the very same except few irrelevant sign inversions.

  • Also, for these two hardcoded ones, used points to re-generate them are clear {0, 1, -1, 2}, i believe they are used de facto by all implementations out there (please correct me if I am wrong).

  • For tile_size=6 (i've seen in your PR, don't know either from where it come, the generated ones contains less smaller fractions, (interpolation points suggested by L. Andrew in one discussion, mentioned in header of PR).

  • For rest of tile_size > 6, I follow practice from wincnn as mentioned earlier, however for such big tile values i think there is not much room to improve precision.

  • I left the so called anchor_points visible in the code, they are from best practice, but maybe in future they can be tuned smartly (lets leave for future tasks).

FUTURE: There will be more fun when we want winograd over int8, or even sparse-winograd with int8.

I try to invite @andravin (not sure he will be aware) to see it's opinion on choosing our anchor points (maybe even auto-tune them in future).

Looking forward for more suggestions.

@andravin
Copy link

Hi @cbalint13, the error analysis in the paper that @zlateski pointed to looks pretty rigorous, I would start there.

I am a little bit curious what you plan to use the large tiles for. Today's devices have high compute / memory bandwidth ratios and low numeric precision, neither of which is conducive to a large tile size. I think the large tiles only payoff when you have an excess of DRAM memory bandwidth, or a large fast cache.

@cbalint13
Copy link
Contributor Author

Hi @cbalint13, the error analysis in the paper that @zlateski pointed to looks pretty rigorous, I would start there.

I am a little bit curious what you plan to use the large tiles for. Today's devices have high compute / memory bandwidth ratios and low numeric precision, neither of which is conducive to a large tile size. I think the large tiles only payoff when you have an excess of DRAM memory bandwidth, or a large fast cache.

Hi @andravin ,

Thanks for joining the thread !

  • As short intro TVM is about optimizing routines (not only NN related) to various targets (and there are a lot of devices out there) from high/low end GPU to small CPU or even FPGAs, by exploiting combinatoric space of the operators itself. Of course TVM is even more than just optimizing, it holds IR language for compiling/doing such and lots of cool other stuff.

  • I try propose here a replacement for some hardcoded winograd matrices (the classical tile_size 2 & 4 for conv2d case) but also leave room for any size up to 8.

  • In practice TVM's autotuning (mostly based on xgboost over collected compute statistics) will find tile size of 2, 3, 4 or rare case even 6 (conducted lots of test on opencl/mali-gpu in winograd scenario), haven't seen to pick up higher values.

  • I know it may sound absurd > 6 but that's the cool part of TVM it will find the proper way to do it (this is the goal to find it auto-magically).

  • As far i know TVM project is unique among libraries it can "autotune" and try find best computational approach for a specific operator over a specific hardware target, it quite does well even if combinatoric space covers millions of possibilities (folks here can tell lot more about)

Most interesting part would be targeting int8 or other quantization variants, so i think by having the matrix generator (maybe some post "precision tuner") would be good allies for winograd class schedules in TVM fashion of doing it.

I would remind you all of some more exotic approach that even uses complex numbers as interpolation points here. Even such approach can be accounted in future.

(apologies for possible redundancy and length of the info)

@cbalint13
Copy link
Contributor Author

For no confusions, to sum up what we have in this PR up to this point:

  1. We replace two hardcoded winograd matrices with generated same ones.
  2. We still have only tile_size = {2, 4} allowed, and this PR will not change this.
  3. We now allow other kernels than {3x3} where K >= 2.

Plus, some doors are open from here on.

@ajtulloch
Copy link
Contributor

This looks good to me - I forgot that this is only for m \in {2, 4}.

One thing I'd personally like is to add a test that verifies the explicit matrix values for m == 2/4, so they're explicitly specified somewhere?

BTW, for the m=6 case, I got them from NNPACK via Maratyszcza/NNPACK#8.

@andravin
Copy link

andravin commented Jul 19, 2019

@cbalint13 and @ajtulloch you might want to try the points that Barabasz et al. say give the best accuracy:

  • 4x4 output: { 0, -1, 1, 1/2, -2 } *
  • 6x6 output: {0, -1, 1, 1/2, -1/2, 2, -2 }
  • 4x1 or 1x4 output (ie 1-D convolution): {0, -1, 1, 1/2, -3} *

Choices that are a bit unusual are marked with *
Paper linked by, @zlateski, above.

I remember searching for F(4x4, 3x3) output points that give the best accuracy a while ago, and did see improvements with non-obvious choices, but the differences were not huge: NervanaSystems/neon#224

Not sure I agree though with how the paper modeled the input and filters with a uniform distribution. In practice, input data is gaussian or sub-gaussian and highly correlated. Anyway this observation made me dubious about modifying algorithms to make them on average more accurate with a random distribution that might not reflect the data that they will be used with in practice. I think a proper error analysis would be measured in Resnet-34 or Resnet-50 using actual weights and real images.

Also here are some optimizations that are often overlooked.

  1. It is possible to use different algorithms for the horizontal and vertical dimensions of a 2D convolution. So you could nest F(4,3) with F(2,3) for example (call it F(4x2, 3x3)), or even F(6,3) with direct convolution, F(6x1, 3x3).

  2. Loop partitioning and reordering can make a big difference in the efficiency of any dense computation, and Winograd/Cook/Toom conv2d is no different. Look to fit a super-tile of transform -> batch matrix multiply -> inverse transform in the innermost loops, so that the transform data stays in the fastest memory (eg shared memory on GPU, L1/L2 cache on CPU). This is important, because the fast convolution algorithms decrease computational density by simultaneously expanding the input data and reducing the number of arithmetic operations.

  3. In order to make the inner loop of 2) efficient, you probably need to use a small tile algorithm for the inner loop, otherwise computational density will be too low and you will be memory bound. Consider using 1) to nest a 1D fast algorithm with a 1D direct convolution (1D tiles are the smallest!), or move a large tile 1D algorithm to the outer loop (so that it goes to DRAM on GPU or L3 cache on CPU), and do a small tile 1D algorithm in the inner loop.

Sorry this is so long. I did not have the time to make it shorter. ;-)

@tqchen
Copy link
Member

tqchen commented Jul 23, 2019

Thanks for the great discussions. @cbalint13 it would be great if we can push forward to act on this PR, by summarizing the discussions, and propose list of changes. Then we can get everyone to review and approve

@cbalint13 cbalint13 changed the title Add Winograd matrices computation. [WIP] Add Winograd matrices computation. Jul 23, 2019
@tqchen
Copy link
Member

tqchen commented Jul 24, 2019

@cbalint13 please request reviews by tagging people(likely those who are in this conversation) when it is ready

@cbalint13
Copy link
Contributor Author

cbalint13 commented Jul 24, 2019

@ajtulloch , @zlateski ,

Done tests using old matrix set versus new proposed schema:

  • all _2_4 tests confirm that old F(2,3), F(4,3) matrices are less accurate than new.
  • all _2_4_6 tests confirm F(6,3) one by @ajtulloch (NNPACK) is also less accurate than new one.
  • used table provided by B. Barabasz et al., cited in the code, anyone can update in future.
  • a shallow test that check to the 1e-5 limit already exist.

@merrymercy , @kevinthesun , @FrozenGene , @tqchen ,

To sum up:

  1. we replace hardcoded matrix schema (the old 2,4) with a flexible computational one.
  2. interpolation points provenance are cited, they can be overrided (if one find better)
  3. we still have only tile_size = {2, 4} allowed and this PR will not change this.
  4. we now allow other kernels than {3x3} where K >= 2 (works for cuda not arm/mali for now).

Immediate future step (not intented for this PR):

  • get rid of such static things (add a tile_size knob for autotunig)
  • try integrate/finish @ajtulloch NCHWc push request (in respect with actual API).
  • fix conv2d for kernel_size > 3 on arm & mali (they fail now, on cuda works).
  • enhance topi winograd test to support more cases.

@cbalint13 cbalint13 changed the title [WIP] Add Winograd matrices computation. Add Winograd matrices computation. Jul 24, 2019
Copy link
Contributor

@ajtulloch ajtulloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

@zlateski zlateski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, otherwise lgtm.


return np.array(in_pts[degree-1], dtype=np.float64)

def winograd_transform_matrices(tile_size, kernel_size, out_dtype):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with the whole codebase, so the following comment might be irrelevant.

This function performs non-trivial compute, which should probably be memoized on some level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zlateski ,
Not sure how to do the memoization. We have a @memoize decorator, it is used only in test scripts so far and i see it can memoize only functions with no arguments on them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zlateski,
Also, if use @memoize can't see too much yield , (it's a pickle file at all) i think the overhead of IO would be greater for such small data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zlateski Very good point! Actually, memoize only write to file when the program exists. Follow up on #3687

@cbalint13
Copy link
Contributor Author

@zlateski ,

I think some hard-coding will be unavoidable :)

At least, from multiple boxed rectangular matrices we ended up in a nice single triangle lookup table :)

@tqchen
Copy link
Member

tqchen commented Jul 25, 2019

@@ -330,57 +329,14 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel)

assert layout == 'NCHW'
assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1
assert KH == KW and HSTR == 1 and WSTR == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test case for this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other parts lgtm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinthesun ,

  • Added 5x5, 7x7 testcases (cuda only)
  • Constrainted arm/mali to 3x3 only (until fix it in other PR)

@merrymercy merrymercy merged commit 97e333c into apache:master Jul 25, 2019
@cbalint13
Copy link
Contributor Author

cbalint13 commented Jul 26, 2019

@ajtulloch
@andravin
@FrozenGene
@kevinthesun
@merrymercy
@tqchen
@zlateski

Thank you all for the help !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants