CpuMath Enhancement: Preamble for hardware intrinsics implementation

Style changes needed to solve part of https://github.com/dotnet/machinelearning/issues/823

## Details
-  Do "preamble" for the implementation of SSE/AVX intrinsics in `src\Microsoft.ML.CpuMath\SseIntrinsics.cs` and `src\Microsoft.ML.CpuMath\AvxIntrinsics.cs`:

[Preamble](https://github.com/dotnet/coreclr/blob/b896dd14830b600043a99c2626ea848ad679fb4f/src/System.Private.CoreLib/shared/System/SpanHelpers.Char.cs#L96-L102): 
1. while (!aligned) { do scalar operation; } // preamble
2. Do vectorized operation using Read**Aligned**
3. while (!end) { do scalar operation; }
For large arrays, especially those that cross cache line or page boundaries, doing this should save some measurable amount of time. 

Reference: https://github.com/dotnet/machinelearning/pull/562/files/f0f81a5019a3c8cbd795a970e40d633e9e1770c1#r204061074
 https://github.com/dotnet/machinelearning/pull/1143

Currently these functions are just using Unaligned Loads, we can make them after by aligning the data and doing aligned loads.

- [ ] AddScalerU
- [ ] ScaleSrcU
- [ ] AddScaleU
- [ ] ScaleAddU
- [ ] AddU
- [ ] AddScaleCopyU
- [ ] AddSU
- [ ]  MulElementWiseU
- [ ] SumU
- [ ] SumSqU
- [ ] SumSqDiffU
- [ ] SumAbsU
- [ ] SumAbsDiffU
- [ ] MaxAbsU
- [ ] MaxAbsDiffU
- [ ] DotU
- [ ] DotSU
- [ ] Dist2
- [ ] SdcaL1UpdateU
- [ ] SdcaL1UpdateSU


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CpuMath Enhancement: Preamble for hardware intrinsics implementation #830

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CpuMath Enhancement: Preamble for hardware intrinsics implementation #830

Description

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions