Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SrCnn entire API by implementing function #5135

Merged
merged 12 commits into from May 21, 2020

Conversation

mengaims
Copy link
Contributor

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

klausmh and others added 4 commits May 15, 2020 11:57
* POC Batch transform

* SrCnn batch interface

* Removed comment

* Handled some APIreview comments.

* Handled other review comments.

* Resolved review comments. Added sample.

Co-authored-by: Yael Dekel <yaeld@microsoft.com>
if (batchSize == -1)
{
_batchSize = (int)input.GetRowCount();
}
Copy link
Contributor Author

@mengaims mengaims May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the transformer version, we allow batchSize to be -1, means using the whole data set as one batch. Here I try to get row count from input when batchSize=-1 and set an actual positive batchSize. But I found that if we load dataview by text file loader, we could not get this count by this way. Is there any other way to achieve this? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML.NET does lazy evaluation of most things. So the textloader may not know the number of rows ahead of time. When batchSize is -1, you will be computing the anomaly on the whole file and so you are going to read the whole file before doing your computation. You will need to pass the -1 value of _batchSize further and calculate the actual row count after you have read all the data and before computation begins.


In reply to: 425803839 [](ancestors = 425803839)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I commented the same thing above before I noticed your comment.
If you need the number of rows in the data, you can compute it here in the ctor, by getting a row cursor with no active columns and counting how many times MoveNext() returns true. So you can do something like this:

var rowCount = input.GetRowCount();
if (rowCount!=null)
    _batchSize = rowCount;
else
{
    _batchSize = 0;
    using (var cursor = input.GetRowCursor())
    {
        while (cursor.MoveNext())
            _batchSize++;
    }
}

In reply to: 426029554 [](ancestors = 426029554,425803839)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, you would need to check for overflow in this case, since the row count can be greater than int.MaxValue.


In reply to: 426407146 [](ancestors = 426407146,426029554,425803839)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concern about count batchSize here since this may introduce an extra traversing. Could we allow batchSize=-1 when initialize Batch object, and in GetIsNewBatchDelegate and GetLastInBatchDelegate, add a special condition for _batchSize=1? In Batch class, since data will be read in later, there is no need to get batchSize from outside when no batch is actually used.


In reply to: 426407783 [](ancestors = 426407783,426407146,426029554,425803839)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, please check.


In reply to: 426719317 [](ancestors = 426719317,426407783,426407146,426029554,425803839)

/// </format>
/// </example>
public static IDataView BatchDetectAnomalyBySrCnn(this DataOperationsCatalog catalog, IDataView input, string outputColumnName, string inputColumnName,
double threshold = 0.3, int batchSize = 1024, double sensitivity = 99, SrCnnDetectMode detectMode = SrCnnDetectMode.AnomalyOnly)
Copy link
Contributor

@harishsk harishsk May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in the TimeSeries catalog? #Resolved

Copy link
Contributor

@klausmh klausmh May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the suggestion from @yaeldekel to put it here #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaeldekel, Can you please elaborate? We already have a TimeSeries catalog. Shouldn't it belong there?


In reply to: 426065077 [](ancestors = 426065077)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous iterations it was in the TransformsCatalog which is why I suggested moving it. TimeSeriesCatalog would also work.


In reply to: 426078156 [](ancestors = 426078156,426065077)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that RCA used a newly defined AnomalyDetectionCatalog, maybe it would be better to align with it?


In reply to: 426439074 [](ancestors = 426439074,426078156,426065077)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Please move it to the AnomalyDetectionCatalog


In reply to: 427140758 [](ancestors = 427140758,426439074,426078156,426065077)

private readonly string _inputColumnName;
private readonly int _outputLength;
private readonly SrCnnEntireModeler _modler;

Copy link
Contributor

@harishsk harishsk May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: _modeler #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426020483 [](ancestors = 426020483)

{
_outputLength = AnomalyOnlyOutputLength;
_modler = new SrCnnEntireModeler(threshold, sensitivity, detectMode, _outputLength);
}
Copy link
Contributor

@harishsk harishsk May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This line is common to all the if blocks and can be moved out of the if blocks. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 426031408 [](ancestors = 426031408)

private const int AnomalyOnlyOutputLength = 3;
private const int AnomalyAndExpectedValueOutputLength = 4;
private const int AnomalyAndMarginOutputLength = 7;

Copy link
Contributor

@harishsk harishsk May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: You can put this in an array to match the index of SrCnnDetectMode and reference the array when initializing the value of _outputLength in the constructor #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426031904 [](ancestors = 426031904)

}

public SrCnnBatchAnomalyDetector(IHostEnvironment env, IDataView input, string inputColumnName, string outputColumnName, double threshold, int batchSize, double sensitivity, SrCnnDetectMode detectMode)
: base(env, "SrCnnBatchAnomalyDetector", input)
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SrCnnBatchAnomalyDetector [](start = 25, length = 25)

Please use nameof(SrCnnBatchAnomalyDetector). #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed,


In reply to: 426396967 [](ancestors = 426396967)

public DataViewSchema Schema => SchemaBindings.AsSchema;

private readonly IDataView _source;
private readonly IHost _host;
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_host [](start = 31, length = 5)

Make this protected, and use in the deriving class. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426397354 [](ancestors = 426397354)

public SrCnnBatchAnomalyDetector(IHostEnvironment env, IDataView input, string inputColumnName, string outputColumnName, double threshold, int batchSize, double sensitivity, SrCnnDetectMode detectMode)
: base(env, "SrCnnBatchAnomalyDetector", input)
{
Contracts.CheckValue(env, nameof(env));
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckValue [](start = 22, length = 10)

This check should be done in the base class. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426397534 [](ancestors = 426397534)

{
Contracts.CheckValue(env, nameof(env));

Contracts.CheckValue(inputColumnName, nameof(inputColumnName));
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts [](start = 12, length = 9)

The IHost in the base class should be protected, and it should be used here. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426397727 [](ancestors = 426397727)

Contracts.CheckValue(inputColumnName, nameof(inputColumnName));
_inputColumnName = inputColumnName;

env.CheckUserArg(batchSize == -1 || batchSize >= MinBatchSize, nameof(batchSize), "BatchSize must be -1 or no less than 12.");
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env [](start = 12, length = 3)

Use IHost from base class. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426397902 [](ancestors = 426397902)

env.CheckUserArg(batchSize == -1 || batchSize >= MinBatchSize, nameof(batchSize), "BatchSize must be -1 or no less than 12.");
if (batchSize == -1)
{
_batchSize = (int)input.GetRowCount();
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetRowCount [](start = 40, length = 11)

That will most likely not work, since most implementations of IDataView return null for this method. #Resolved

|| detectMode == SrCnnDetectMode.AnomalyAndExpectedValue
|| detectMode == SrCnnDetectMode.AnomalyAndMargin, nameof(detectMode), "Invalid detectMode");

if (detectMode.Equals(SrCnnDetectMode.AnomalyOnly))
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnomalyOnly [](start = 50, length = 11)

Since this is an enum, this can be done using a switch statement. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, you don't need an if or a switch statement if you store the possible values for _outputLength in an const array whose indices match the the enum values.


In reply to: 426408554 [](ancestors = 426408554)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 426828074 [](ancestors = 426828074,426408554)


protected override Batch InitializeBatch(DataViewRowCursor input) => new Batch(_batchSize, _outputLength, _modler);

protected override Func<bool> GetIsNewBatchDelegate(DataViewRowCursor input)
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetIsNewBatchDelegate [](start = 38, length = 21)

I think this method and GetLastInBatchDelegate can be implemented in the base class, I can't imagine a future implementation of the base class that would use a different way of computing this. @harishsk, what do you think? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a check for _batchSize==-1 is added here, I think it's proper to keep it in inherit class


In reply to: 426409875 [](ancestors = 426409875)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding -1 as an option meaning "the entire data view" is reasonable for the base class, but I won't insist on it if you think that it is something specific to this implementation of the base class.
/cc @harishsk


In reply to: 427135642 [](ancestors = 427135642,426409875)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -1 as the "entire data view" is reasonable. But if @mengaims feels it is proper to keep it in the inherited class, I am okay with it since we only have one derived class using it now. If we get another inherited class, we can revisit this then and refactor if necessary. Since this is an internal implementation detail, there shouldn't be any issues.


In reply to: 427278034 [](ancestors = 427278034,427135642,426409875)


protected override Func<int, bool> GetSchemaBindingDependencies(Func<int, bool> predicate)
{
return (SchemaBindings as Bindings).GetDependencies(predicate);
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaBindings [](start = 20, length = 14)

Suggestion: Add a private field of type Bindings:

private readonly Bindings _bindings;

and then the override of SchemaBindings would be:

public override ColumnBindingsBase SchemaBindings => _bindings;

and this method would return _bindings.GetDependencies(predicate). #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 426410915 [](ancestors = 426410915)

private readonly int _batchSize;
private readonly int _outputLength;
private SrCnnEntireModeler _modler;
private double[][] _results;
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_results [](start = 31, length = 8)

Why not make this readonly? That way you don't have to allocate a new array for every batch. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the scenario that bachSize=-1, and we do not know the exact row count until we read in data, results could not be allocated when initialize Batch.


In reply to: 426411747 [](ancestors = 426411747)

}
}

public sealed class SrCnnEntireModeler
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

Can this be private? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Ideally this should be private


In reply to: 426413214 [](ancestors = 426413214)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is similar as Batch class, used by protect methods as parameter in SrCnnBatchAnomalyDetector, if make them private, the accessasiblity will not be aligned and there will be compile errors. I've made them internal.


In reply to: 426828580 [](ancestors = 426828580,426413214)

_threshold = threshold;
_sensitivity = sensitivity;
_detectMode = detectMode;
_outputLength = outputLength;
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_outputLength [](start = 16, length = 13)

_outputLength is determined by _detectMode, so you can either use the array in the wrapping class like Harish suggested, or check that _outputLength and _detectMode are consistent. (actually, if you make this a private class, then an assert instead of a check would be enough). #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 426418482 [](ancestors = 426418482)


public double[][] Train(double[] values)
{
double[][] results = new double[values.Length][];
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results [](start = 27, length = 7)

Pass in an array so you don't have to allocate a new one on every call. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason explained above.


In reply to: 426418976 [](ancestors = 426418976)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, then should it be a ref input to the method? that way you allocate a new array only if it's null.
There are two cases: when batchSize=-1, and when batchSize>0. In the first case, you allocate only once anyway since you call this method only after the entire pass over the data. In the second case, you would allocate new arrays for each batch, wouldn't it improve perf to only allocate once?


In reply to: 427138599 [](ancestors = 427138599,426418976)

Copy link
Contributor

@klausmh klausmh May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed #Resolved

return results;
}

private static void SpecturalResidual(double[] values, double[][] results, double threshold)
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpecturalResidual [](start = 32, length = 17)

Should this be SpectralResidual? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 426419496 [](ancestors = 426419496)


// Step 2: FFT transformation
int length = backAddList.Length;
double[] fftRe = new double[length];
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fftRe [](start = 25, length = 5)

Is there a way to make these fields of the class so they are only allocated once? #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this class contain buffers for all these computations to avoid multiple allocations of arrays. This would mean that the SrCnnBatchAnomalyDetector cannot have an SrcEntireModeler as a field, and the InitializeBatch method would have to instantiate it (IDataVews should be thread safe, since it should be possible to create multiple row cursors on multiple threads from an IDataView object).


In reply to: 426420411 [](ancestors = 426420411)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is the same as results array, due to the batchSizxe=-1 scenario, we will not know the length of these arrays until we actually read in data.


In reply to: 426431891 [](ancestors = 426431891,426420411)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a bunch of arrays for each batch might have a perf price, with the GC going into overdrive. Consider allocating them only once and re-using them for subsequent batches.


In reply to: 427139611 [](ancestors = 427139611,426431891,426420411)

Copy link
Contributor

@klausmh klausmh May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If every instance of SrEntireModeler contained preallocated buffers, would the result not be the same (since InitializeBatch would then have to allocate a new instance of SrEntireModeler)? #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because InitializeBatch is called only once, when the row cursor is created (see line 100 in BatchDataViewMapperBase.cs). The SrEntireModeler methods (Train and the helper methods that are called from it) are called after every batch is read from the input cursor (see line 168 in BatchDataViewMapperBase.cs).


In reply to: 427614415 [](ancestors = 427614415)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps InitializeBatch should be renamed CreateBatch to avoid confusion?


In reply to: 427764070 [](ancestors = 427764070,427614415)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


In reply to: 427764306 [](ancestors = 427764306,427764070,427614415)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 427285088 [](ancestors = 427285088,427139611,426431891,426420411)

@yaeldekel yaeldekel self-requested a review May 18, 2020 07:35
return cumSumList;
}

private static double CalculateSocre(double mag, double avgMag)
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soc [](start = 43, length = 3)

Typo. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 426435676 [](ancestors = 426435676)

using System.Linq;
using Microsoft.ML.Data;
using Microsoft.ML.Data.DataView;
using Microsoft.ML.Numeric;
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numeric [](start = 19, length = 7)

Is this used? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.


In reply to: 426436746 [](ancestors = 426436746)

/// ]]>
/// </format>
/// </example>
public static IDataView BatchDetectAnomalyBySrCnn(this DataOperationsCatalog catalog, IDataView input, string outputColumnName, string inputColumnName,
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchDetectAnomalyBySrCnn [](start = 32, length = 25)

Can this be renamed DetectAnomaliesBy...? #Resolved

IDataView dataView;
if (loadDataFromFile)
{
var dataPath = GetDataPath(Path.Combine("Timeseries", "anomaly_detection.csv"));
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.Combine( [](start = 43, length = 13)

I think there is an overload of GetDataPath which does Path.Combine for you. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


In reply to: 426440599 [](ancestors = 426440599)

{
var ml = new MLContext(1);
IDataView dataView;
if (loadDataFromFile)
Copy link

@yaeldekel yaeldekel May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadDataFromFile [](start = 16, length = 16)

Why would there be a difference? Does that test different functionality of the new API? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by Harish to test both loading dataview from file and from List


In reply to: 426441364 [](ancestors = 426441364)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the earlier editions of the API (when it used some non-primitive data types) I thought I saw some issues with whether the API would support loading from files cleanly. So I had asked for this test. It may not be required now. But if it is here already, it is is okay to leave it as is.


In reply to: 426707236 [](ancestors = 426707236,426441364)

else if (results.Length > values.Length)
{
Array.Resize<double[]>(ref results, values.Length);
}
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to do this? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last batch may have different size. We only borrows data when the last batch has less than 12 points, when it has more than 12, we will use the length as it is.


In reply to: 427851487 [](ancestors = 427851487)


private double[] BackAdd(double[] data)
{
AllocateDoubleArray(ref _predictArray, _lookaheadWindowSize + 1);
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_predictArray [](start = 40, length = 13)

This is always allocated to be the same size, and doesn't depend on the batch size. Can be readonly and instantiated in the constructor. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


In reply to: 427853277 [](ancestors = 427853277)

{
arr = new double[length];
}
else if (arr.Length != length)
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arr.Length != length [](start = 25, length = 20)

Similar question - why do you need this?
Is this because of the last batch that may be smaller than the rest? Can't you just ignore the last few elements in the arrays instead of resizing? (resizing allocates a new array). #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in email


In reply to: 427855107 [](ancestors = 427855107)

private void SpectralResidual(double[] values, double[][] results, double threshold)
{
// Step 1: Get backadd wave
double[] backAddList = BackAdd(values);
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BackAdd [](start = 39, length = 7)

I think this puts the values in _backAddArray, it doesn't need to return it. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this and other similar places.


In reply to: 427856187 [](ancestors = 427856187)

AllocateDoubleArray(ref _fftRe, length);
AllocateDoubleArray(ref _fftIm, length);

FftUtils.ComputeForwardFft(backAddList, Enumerable.Repeat((double)0.0f, length).ToArray(), _fftRe, _fftIm, length);
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerable.Repeat((double)0.0f, length).ToArray() [](start = 56, length = 49)

Can you store a buffer of zeros as a field? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In reply to: 427856973 [](ancestors = 427856973)

AllocateDoubleArray(ref _fftRe, length);
AllocateDoubleArray(ref _fftIm, length);

FftUtils.ComputeForwardFft(backAddList, Enumerable.Repeat((double)0.0f, length).ToArray(), _fftRe, _fftIm, length);
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputeForwardFft [](start = 25, length = 17)

I think you might be able to change this method to take in Span<double> instead of arrays, this way you won't have to resize, you can just take a subspan of the arrays, and the arrays will always be the same length. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in email


In reply to: 427857936 [](ancestors = 427857936)

_previousBatch = _previousBatch.GetRange(_batch.Count, bLen);
_previousBatch.AddRange(_batch);
_modeler.Train(_previousBatch.ToArray(), ref _results);
_results = _results.Skip(bLen).ToArray();
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip(bLen).ToArray() [](start = 40, length = 20)

Suggestion: Instead of skipping, keep an int indicating the "beginning" of _results, so the getter would return the value in _results[input.Position % _batchSize + _bLen] instead of _results[input.Position % _batchSize]. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as you suggested.


In reply to: 427863039 [](ancestors = 427863039)

for (int i = 0; i < _outputLength; ++i)
{
result.Values[i] = _results[input.Position % _batchSize][i];
}
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorter alternative:

_results[...].CopyTo(result.Values)
``` #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as you suggested.


In reply to: 427864940 [](ancestors = 427864940)

private void GetExpectedValue(double[] values, double[][] results)
{
//Step 8: Calculate Expected Value
var exps = CalculateExpectedValueByFft(GetDeanomalyData(values, GetAnomalyIndex(results.Select(x => x[1]).ToArray())));
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateExpectedValueByFft [](start = 27, length = 27)

Can return void, values are stored in a field. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 427865834 [](ancestors = 427865834)

private void GetMargin(double[] values, double[][] results, double sensitivity)
{
//Step 8: Calculate Expected Value
var exps = CalculateExpectedValueByFft(GetDeanomalyData(values, GetAnomalyIndex(results.Select(x => x[1]).ToArray())));
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateExpectedValueByFft [](start = 27, length = 27)

Return void. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this and other similar places.


In reply to: 427867351 [](ancestors = 427867351)

var exps = CalculateExpectedValueByFft(GetDeanomalyData(values, GetAnomalyIndex(results.Select(x => x[1]).ToArray())));

//Step 9: Calculate Boundary Unit
var units = CalculateBoundaryUnit(values, results.Select(x => x[0] > 0).ToArray());
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateBoundaryUnit [](start = 28, length = 21)

Same. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 427867423 [](ancestors = 427867423)

var units = CalculateBoundaryUnit(values, results.Select(x => x[0] > 0).ToArray());

//Step 10: Calculate UpperBound and LowerBound
var margins = units.Select(x => CalculateMargin(x, sensitivity)).ToList();
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateMargin [](start = 48, length = 15)

I would do this inside the for loop:

for (int i = 0; i < results.Length; ++i)
                {
                    var margin = CalculateMargin(units[i], sensitivity);
                    results[i][3] = exps[i];
                    results[i][4] = units[i];
                    results[i][5] = exps[i] + margin;
                    results[i][6] = exps[i] - margin;
                    //Step 11: Update Anomaly Score
                    results[i][1] = CalculateAnomalyScore(values[i], exps[i], units[i], results[i][0] > 0);
                }
``` #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In reply to: 427869378 [](ancestors = 427869378)


private double[] GetDeanomalyData(double[] data, int[] anomalyIdxList)
{
double[] deAnomalyData = (double[])data.Clone();
Copy link

@yaeldekel yaeldekel May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double[] deAnomalyData = (double[])data.Clone(); [](start = 16, length = 48)

Can this be converted to a field similarly to the rest of the arrays this class uses? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 427869960 [](ancestors = 427869960)

@mengaims mengaims marked this pull request as ready for review May 21, 2020 02:11
@mengaims mengaims requested a review from a team as a code owner May 21, 2020 02:11
Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

for (int i = 0; i < arr.Length; ++i)
{
arr[i] = 0.0;
}
Copy link

@yaeldekel yaeldekel May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values will be 0 by default. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since values are 0 by default, removed this fuction, use AllocateDoubleArray instead, they are equally.


In reply to: 428513049 [](ancestors = 428513049)

if (_batch.Count < MinBatchSize)
{
if (_previousBatch.Count + _batch.Count < MinBatchSize)
return;
Copy link

@yaeldekel yaeldekel May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return [](start = 24, length = 6)

In what scenario does this happen?
What does it mean for the getter, doesn't it need to return something? It seems that if this happens it will just return the "old" values in _results. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the condition to _previousBatch.Count == 0 to make it easier to understand. This happens when batchSize is set to -1 but the input point count < 12. I changed the return to throw Exception here.
if (_previousBatch.Count == 0)
{
throw new InvalidOperationException("The input must contain no less than 12 points.");
}


In reply to: 428515960 [](ancestors = 428515960)

{
if (_previousBatch.Count == 0)
{
throw new InvalidOperationException("The input must contain no less than 12 points.");
Copy link

@yaeldekel yaeldekel May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidOperationException [](start = 34, length = 25)

throw Contracts.Except... #Resolved

Copy link
Contributor

@klausmh klausmh May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed #Resolved


using System;
using System.Collections.Generic;
using System.IO;
Copy link

@yaeldekel yaeldekel May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO [](start = 13, length = 2)

Is this needed? #Resolved

Copy link
Contributor

@klausmh klausmh May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.io is not needed. Fixed #Resolved

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@harishsk harishsk changed the title Add SrCnn entire API by implementing funciton Add SrCnn entire API by implementing function May 21, 2020
@harishsk harishsk merged commit a4af0ec into dotnet:master May 21, 2020
/// <param name="outputColumnName">Name of the column resulting from data processing of <paramref name="inputColumnName"/>.
/// The column data is a vector of <see cref="System.Double"/>. The length of this vector varies depending on <paramref name="detectMode"/>.</param>
/// <param name="inputColumnName">Name of column to process. The column data must be <see cref="System.Double"/>.</param>
/// <param name="threshold">The threshold to determine anomaly, score larger than the threshold is considered as anomaly. Must be in [0,1]. Default value is 0.3.</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definition of threshold refers to "score," which I don't see defined anywhere

@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants