# Description

This notebook explores approaches to detect outliers in crypto data.

# Imports

In [1]:
import logging

import pandas as pd

import helpers.dbg as hdbg
import helpers.env as henv
import helpers.printing as hprint
import im.ccxt.data.load.loader as imcdalolo
import research_amp.cc.detect_outliers as rccdeout

In [2]:
hdbg.init_logger(verbosity=logging.INFO)

_LOG = logging.getLogger(__name__)

_LOG.info("%s", henv.get_system_signature()[0])

hprint.config_notebook()

[0m[36mINFO[0m: > cmd='/venv/lib/python3.8/site-packages/ipykernel_launcher.py -f /home/.local/share/jupyter/runtime/kernel-9bb677ed-7544-42e6-ada2-ee79c7696d0e.json'
>>ENV<<: is_inside_container=True: code_version=None, container_version=cmamp-1.0.0, is_inside_docker=True, is_inside_ci=False, CI_defined=True, CI=''
>>ENV<<: AM_AWS_PROFILE=True AM_ECR_BASE_PATH=True AM_S3_BUCKET=True AM_TELEGRAM_TOKEN=True AWS_ACCESS_KEY_ID=False AWS_DEFAULT_REGION=False AWS_SECRET_ACCESS_KEY=False GH_ACTION_ACCESS_TOKEN=True
[31m-----------------------------------------------------------------------------
This code is not in sync with the container:
code_version='None' != container_version='cmamp-1.0.0'
-----------------------------------------------------------------------------
You need to:
- merge origin/master into your branch with `invoke git_merge_master`
- pull the latest container with `invoke docker_pull`[0m
# Git
    branch_name='CmTask456_Refactor_CCXT_loader_in_im_v2'
    hash='841fba

# Load test data

In [3]:
root_dir = "s3://alphamatic-data/data"
сcxt_loader = imcdalolo.CcxtLoaderFromFile(root_dir=root_dir, aws_profile="am")
data = сcxt_loader.read_data("kucoin", "ETH/USDT", "ohlcv")
data.head()

Reading CCXT data for exchange id='kucoin', currencies='ETH/USDT' from file='s3://alphamatic-data/data/ccxt/20210924/kucoin/ETH_USDT.csv.gz'...
Processing CCXT data for exchange id='kucoin', currencies='ETH/USDT'...
Index length increased by 117548 = 1619960 - 1502412


Unnamed: 0,open,high,low,close,volume,epoch,currency_pair,exchange_id
2018-08-16 20:01:00-04:00,286.712987,286.712987,286.712987,286.712987,0.0175,1534464000000.0,ETH/USDT,kucoin
2018-08-16 20:02:00-04:00,286.405988,286.405988,285.400193,285.400197,0.162255,1534464000000.0,ETH/USDT,kucoin
2018-08-16 20:03:00-04:00,285.400193,285.400193,285.400193,285.400193,0.02026,1534464000000.0,ETH/USDT,kucoin
2018-08-16 20:04:00-04:00,285.400193,285.884638,285.400193,285.884638,0.074655,1534464000000.0,ETH/USDT,kucoin
2018-08-16 20:05:00-04:00,285.400196,285.884637,285.400196,285.884637,0.006141,1534464000000.0,ETH/USDT,kucoin


Get multiple chunks of the latest data for performance checks.

In [4]:
# Exactly 10-days length chunk.
chunk_10days = data.tail(14400).copy()
# Exactly 20-days length chunk.
chunk_20days = data.tail(28800).copy()
# Exactly 40-days length chunk.
chunk_40days = data.tail(57600).copy()

# Mask approach

Below you can see that execution time grows exponentially to the growth of input series chunk.

If we take number of days in chunk as `x` for a rough approximation, rounded execution time in seconds as `y`, and build an equation that corresponds to the test samples then we get the following:<br>
`y = (11/1500)x^2 + (3/4)x + (4/15)`<br>

Then processing full 1619960 length series should take ~3-4 hours to complete so we should think about the ways to apply this function effectively.

In [5]:
%%time
outlier_mask_10days = rccdeout.detect_outliers(
    srs=chunk_10days["close"], n_samples=1440, z_score_threshold=4
)

CPU times: user 8.72 s, sys: 0 ns, total: 8.72 s
Wall time: 8.72 s


In [6]:
%%time
outlier_mask_20days = rccdeout.detect_outliers(
    srs=chunk_20days["close"], n_samples=1440, z_score_threshold=4
)

CPU times: user 18.6 s, sys: 0 ns, total: 18.6 s
Wall time: 18.6 s


In [7]:
%%time
outlier_mask_40days = rccdeout.detect_outliers(
    srs=chunk_40days["close"], n_samples=1440, z_score_threshold=4
)

CPU times: user 44.5 s, sys: 37.2 ms, total: 44.5 s
Wall time: 44.5 s


Another problem with this approach is that its results are not robust to the cases when a harsh ascent or decline has happened and the price direction has continued. In this case all the values after this harsh change are considered outliers and dropped.

Take a look at 10-days chunk result. It has 76% of its values considered outliers with Z-score threshold equals 4 while 3 is a standard. After 2021-09-07 04:25:00-04:00 the price falls from 3848.65 to 3841.97 and all the following observations that are below 3841.95 are considered outliers as well.<br>

This is expected since we do not implement window data normalization before computing z-scores while the data we have clearly has trends at least and the values on the brick of z-score window can easily drop out from standard z-score threshold.

Since crypto data is very volatile, we can end up with losing a lot of data in this case so we should consider the right values for window sample size and Z-scores.

In [8]:
outlier_mask_10days.sum() / outlier_mask_10days.shape[0]

0.7622222222222222

In [9]:
outlier_mask_10days[:3426]

array([False, False, False, ..., False, False,  True])

In [10]:
outlier_mask_10days[3426:]

array([ True,  True,  True, ...,  True,  True,  True])

In [11]:
set(outlier_mask_10days[3426:])

{True}

In [12]:
chunk_10days["close"][~outlier_mask_10days].tail()

2021-09-07 04:21:00-04:00    3856.48
2021-09-07 04:22:00-04:00    3848.52
2021-09-07 04:23:00-04:00    3848.09
2021-09-07 04:24:00-04:00    3853.52
2021-09-07 04:25:00-04:00    3848.65
Name: close, dtype: float64

In [13]:
chunk_10days["close"][outlier_mask_10days].head()

2021-09-06 08:03:00-04:00    3866.12
2021-09-07 04:26:00-04:00    3841.97
2021-09-07 04:27:00-04:00    3811.05
2021-09-07 04:28:00-04:00    3828.69
2021-09-07 04:29:00-04:00    3808.34
Name: close, dtype: float64

All the other chunks have a lot of false outliers as well.

In [14]:
print(outlier_mask_20days.sum() / outlier_mask_20days.shape[0])
print(outlier_mask_40days.sum() / outlier_mask_40days.shape[0])

0.5251388888888889
0.3101388888888889


# Dropping outliers on-flight approach

In [171]:
def remove_outlier_at_index(
    srs: pd.Series,
    z_score_boundary: int,
    z_score_window_size: int,
    index_to_check: int,
) -> pd.Series:
    """
    Check if a series value at index is an outlier and remove it if so.

    Index should be a row of positive integers like 0, 1, 2, etc.

    Z-score window indices are adjusting with respect to its size, the size of input
    and index to check.

    Z-score window size is an integer number of index steps that will be included
    in Z-score computation and outlier detection.

    :param srs: input series
    :param z_score_boundary: boundary value to check for outlier's Z-score
    :param z_score_window_size: size of the window to compute Z-score for
    :param index_to_check: index of a value to check
    :return: input series with removed value at given index if it was considered an outlier
    """
    # Get numerical order of a given index.
    index_order = srs.index.get_loc(index_to_check)
    # Set window indices.
    window_first_index = max(0, index_order - z_score_window_size)
    window_last_index = max(index_order, window_first_index + z_score_window_size)
    # Verify that distance between window indices equals Z-score window size
    # and that index to check is laying between these indices.
    hdbg.dassert_eq(z_score_window_size, window_last_index - window_first_index)
    hdbg.dassert_lte(window_first_index, index_order)
    hdbg.dassert_lte(index_order, window_last_index)
    # Get a window to compute Z-score for.
    window_srs = srs.iloc[window_first_index:window_last_index].copy()
    # Compute Z-score of a value at index.
    z_score = (srs[index_order] - window_srs.mean()) / window_srs.std()
    # Drop the value if its Z-score is None or laying beyond the specified boundaries.
    if not abs(z_score) <= z_score_boundary:
        srs = srs.drop([index_to_check]).copy()
    return srs


def remove_rolling_outliers(
    df: pd.DataFrame,
    col: str,
    z_score_boundary: int,
    z_score_window: int,
) -> pd.DataFrame:
    """
    Remove outliers using a rolling window.

    Outliers are being removed consequtively after every window check.

    Z-score window indices are adjusting with respect to its size, the size of input
    and index to check.

    Z-score window size is an integer number of index steps that will be included
    in Z-score computation and outlier detection.

    :param srs: input dataframe
    :param col: column to check for outliers
    :param z_score_boundary: Z-score boundary to check the value
    :param z_score_window: size of the window to compute Z-score for
    :return: dataframe with removed outliers
    """
    # Get a series to detect outliers in.
    price_srs = df[col].copy()
    # Iterate over series indices.
    for index_ in price_srs.index:
        # For every index check if its value is an outlier and
        # remove it from the series if so.
        price_srs = remove_outlier_at_index(
            price_srs, z_score_boundary, z_score_window, index_
        )
    # Get dataframe rows that correspond to the non-outliers indices.
    clean_df = df[df.index.isin(price_srs.index)].copy()
    return clean_df

Dropping outliers on-flight approach seems to work slower on small chunks and its execution time grows exponentially to the series length as well.

So needless to say, its less effective than the mask one and should not be used.

In [174]:
%%time
old_clean_chunk_10days = remove_rolling_outliers(chunk_10days, "close", 3, 1440)

CPU times: user 19 s, sys: 89 µs, total: 19 s
Wall time: 19 s


In [175]:
%%time
old_clean_chunk_20days = remove_rolling_outliers(chunk_20days, "close", 3, 1440)

CPU times: user 41.2 s, sys: 79 µs, total: 41.2 s
Wall time: 41.2 s


In [176]:
%%time
old_clean_chunk_40days = remove_rolling_outliers(chunk_40days, "close", 3, 1440)

CPU times: user 1min 34s, sys: 11.9 ms, total: 1min 34s
Wall time: 1min 34s


# Overlapping windows approach

In [37]:
def detect_outliers_new(
    srs: pd.Series,
    n_samples: int = 1440,
    window_step: int = 10,
    z_score_threshold: float = 3.0,
):
    """
    Detect outliers using overlapping windows and averaged z-scores of each
    observation.

    Almost every observation will belong to `n_samples` of windows which means that each one
    is going to have `n_samples` of Z-scores. The mean of these scores will give an averaged
    Z-score which will be a more robust metrics to check if a value is an outlier than
    a rolling Z-score computed just once.

    This function
    - creates list of overlapping z-score windows
    - computes z-score of each element in every window
    - for each observation takes average of all the z-scores from the windows it belongs to
    - compares averaged z-score to the threshold to declare the current element an outlier

    :param srs: input series
    :param n_samples: number of samples in z-score windows
    :param z_score_threshold: threshold to mark a value as an outlier based on its averaged z-score
    :return: whether the element at index idx is an outlier
    """
    # Create a list of overlapping windows.
    windows = [
        srs.iloc[idx : idx + n_samples]
        for idx in range(0, srs.shape[0] - n_samples + window_step, window_step)
    ]
    # Compute z-score for each observation in every window.
    z_scores_list = [
        abs((window - window.mean()) / window.std()) for window in windows
    ]
    # Concatenate z-scores series in one.
    z_scores_srs = pd.concat(z_scores_list)
    # Groupby by index and take the averaged z-score for every index value.
    z_scores_stats = z_scores_srs.groupby(z_scores_srs.index).mean()
    # Get a mask for outliers.
    # Done via `<=` since a series can contain None values that should be detected
    # as well but will result to NaN if compared to the threshold directly.
    outliers_mask = ~(z_scores_stats <= z_score_threshold)
    return outliers_mask

Since both approaches suggested above are very slow and can't be really applied to all the data directly, I'd like to propose another approach to this problem.

Description of the approach can be found in a function docstrings. In short, this is not an approach that has a memory but here we compute an averaged z-score for each observation for multiple windows it belongs to. IMO this should make outlier detection more robust and give consistent results for most observations (only corner cases may differ, no observations are removed so full-sized windows are always constant).

This approach might be less robust to consecutive outliers than the previous ones but it demonstrates extremely faster performance. it processes the whole series in just 2 minutes with 1-day sized windows that overlap each 10 minutes.<br>
Therefore, if this algorithm robustness is enough for us, I suggest we use it for outlier detection.

In [38]:
all_outliers_mask = detect_outliers_new(data["close"])

The algorithm detects None values and the small amount of outliers.

In [45]:
all_outliers = data["close"][all_outliers_mask]
len(all_outliers)

118231

In [47]:
len(all_outliers.dropna())

683

Computations for small chunks are done almost immediately, all the detected outliers are stable across the chunks.

In [50]:
outlier_mask_new_40days = detect_outliers_new(chunk_40days["close"]).dropna()

In [53]:
chunk_40days["close"][outlier_mask_new_40days]

2021-08-10 09:31:00-04:00    3224.65
2021-08-10 09:34:00-04:00    3228.84
2021-08-10 09:35:00-04:00    3229.72
2021-08-10 09:36:00-04:00    3224.46
2021-09-06 08:03:00-04:00    3866.12
2021-09-07 11:08:00-04:00    3083.99
2021-09-07 11:09:00-04:00    3016.85
2021-09-07 11:10:00-04:00    3120.77
Name: close, dtype: float64

In [51]:
outlier_mask_new_20days = detect_outliers_new(chunk_20days["close"]).dropna()

In [54]:
chunk_20days["close"][outlier_mask_new_20days]

2021-09-06 08:03:00-04:00    3866.12
2021-09-07 11:08:00-04:00    3083.99
2021-09-07 11:09:00-04:00    3016.85
2021-09-07 11:10:00-04:00    3120.77
Name: close, dtype: float64

In [52]:
outlier_mask_new_10days = detect_outliers_new(chunk_10days["close"]).dropna()

In [55]:
chunk_10days["close"][outlier_mask_new_10days]

2021-09-06 08:03:00-04:00    3866.12
2021-09-07 11:08:00-04:00    3083.99
2021-09-07 11:09:00-04:00    3016.85
2021-09-07 11:10:00-04:00    3120.77
Name: close, dtype: float64