Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get exact outliers in Univariate Time Series using OutlierDetector? #139

Closed
sthirumoorthi opened this issue Oct 12, 2021 · 9 comments

Comments

@sthirumoorthi
Copy link

Hello, I'm trying to analyze the Outlier Detection framework for my project but it appears like the model returns the outlier range (not the exact index). Below are the details about my dataset.

2019-01-01 | 35
2019-01-02 | 32
2019-01-03 | 30
2019-01-04 | 31
2019-01-05 | 44
2019-01-06 | 29
2019-01-07 | 45
2019-01-08 | 43
2019-01-09 | 500
2019-01-10 | 27
2019-01-11 | 38
.....
I would expect the model to return the outlier as "500" and date as "2019-01-09". But the model returns as below.
ts_outDetection.outliers[0] ->
[Timestamp('2019-01-06 00:00:00'),
Timestamp('2019-01-07 00:00:00'),
Timestamp('2019-01-08 00:00:00'),
Timestamp('2019-01-09 00:00:00'),
Timestamp('2019-01-10 00:00:00'),
Timestamp('2019-01-11 00:00:00'),
Timestamp('2019-01-12 00:00:00')]

Can someone help me to understand the outlier detector concept in Kats or direct me to the reference document(if any) please?
Let me know if you need more details.
FB Kats Issue

@MoKazemi9
Copy link
Contributor

Hi @sthirumoorthi , Thanks for opening the issue, our tutorial for outlier detection (2. Outlier Detection) will be useful, pls let us know if any other questions.
https://github.com/facebookresearch/Kats/blob/main/tutorials/kats_202_detection.ipynb

@sthirumoorthi
Copy link
Author

Hello, Thanks for the quick response. I was referring to the outlier detection tutorial available in this link. However, the model returns outliers in a specific range instead of returning single outlier value or index.

In my example dataset, i was expecting the model to return "500" and "73" as an outlier or index. But it did return additional values/indexes.

ts_outDetection.outliers[0] ->
[Timestamp('2019-01-06 00:00:00'),
Timestamp('2019-01-07 00:00:00'),
Timestamp('2019-01-08 00:00:00'),
Timestamp('2019-01-09 00:00:00'),
Timestamp('2019-01-10 00:00:00'),
Timestamp('2019-01-11 00:00:00'),
Timestamp('2019-01-12 00:00:00')]

Would it be possible to check the outcome of this model and provide your comments please?

@sthirumoorthi
Copy link
Author

For further reference, i have included my test dataset (first 20 observations) and highlighted the outliers which was detected by the Kats OutlierDetector model.
FB Kats dataset

@MoKazemi9
Copy link
Contributor

@sthirumoorthi, thanks for sharing your results. did you transform your data to TimeSeriesData before applying the detector?
from kats.consts import TimeSeriesData

@sthirumoorthi
Copy link
Author

sthirumoorthi commented Oct 14, 2021

Hi @MoKazemi9, Thanks for checking the details. Yes. I did the transformation before applying the detector.

#transform the data for outlier detection
birth_ts = TimeSeriesData(df_birth)

#Outlier Detection model for the 'Daily Total Female Birth' dataset
ts_outDetection = OutlierDetector(birth_ts, 'additive')
ts_outDetection.detector()

My complete Python file is available in my GitHub repository with the test dataset, for your reference.
https://github.com/sthirumoorthi/TimeSeries-Models/tree/main/FB%20Kats%20with%20Example

@sanelemahlalela
Copy link

I think @sthirumoorthi is asking why the detector returns the index (time column) , instead of the value (value column) on his data set. If this is the case @sthirumoorthi , I think that you are better off getting back the index. With the index then you can find the value on your copy of the dataset. Incase you have multiple rows with that particular value, in this case 500, then, you'll receive multiple indexes. This is my understanding. I might be wrong

@sthirumoorthi
Copy link
Author

Hi @sanelemahlalela.. Getting the value or index is not a problem. The model returns range of values/indexes or multiple values/indexes instead of returning the outliers. In the above example, the model returns the below timestamps as outliers wherein i was expecting only one timestamp ('2019-01-09 00:00:00') as outlier (value '500').

ts_outDetection.outliers[0] ->
[Timestamp('2019-01-06 00:00:00'),
Timestamp('2019-01-07 00:00:00'),
Timestamp('2019-01-08 00:00:00'),
Timestamp('2019-01-09 00:00:00'),
Timestamp('2019-01-10 00:00:00'),
Timestamp('2019-01-11 00:00:00'),
Timestamp('2019-01-12 00:00:00')]

As per my understanding, we don't have any hyper-parameters to adjust (except iqr_mul) to increase the accuracy of the model. So i wanted to understand how the outlier detection logic works for this model and what needs to be done to detect the exact outliers.

@sanelemahlalela
Copy link

I don't get the problem of multiple time stamps in your data.

Screenshot 2021-10-16 at 15 44 00

@sthirumoorthi
Copy link
Author

@sanelemahlalela.. Thanks for checking the details. The detector returns the results as expected, in your case. Not sure why this is not working for my test data. Might be an issue in my test data? or the way i coded the logic? I double checked the python code and couldn't find any issue.

I understand that you have picked portion of my test data for your test. Can you try the actual data from my repository (csv file) please?
https://github.com/sthirumoorthi/TimeSeries-Models/tree/main/FB%20Kats%20with%20Example
Both (python code & csv file) the files are available in the below link

@rohanfb rohanfb closed this as completed Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants