<div style="float:right; width:100px; text-align: center; margin: 10px;">
<img src="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/f6db6d22-3b62-42e0-8ac0-e07c52649e63_1202x1291.png" alt="hummingbot"/>
<img src="https://crypto-lake.com/assets/img/lake.png" alt="Lake"/>
</div>

# Fake volume detection

Detect fake trades that happen especially on altcoins on smaller exchanges. This cleans up the data for futher analysis or feature computation.

The exchanges usually claim its not them generating the fake trades: 'some market markers do it' and the exchange cannot do anything about it. Well, surely they don't want to, as the higher volumes help them to reach better volume compared to their competition.

We use [crypto-lake.com](https://crypto-lake.com/#data) sample/free market data.

Quick links:
- [edit this notebook online](https://mybinder.org/v2/gh/crypto-lake/analysis-sharing/main?filepath=fake_volume_detection.ipynb) using Binder
- go to [github repo](https://github.com/crypto-lake/analysis-sharing/) and read about our analysis contest
- [follow our activity on twitter](https://twitter.com/intent/user?screen_name=crypto_lake_com)

In [198]:
import datetime

import pandas as pd
import cufflinks as cf

import lakeapi

# Access crypto-lake free data
lakeapi.use_sample_data(anonymous_access=True)
# Set up default cufflinks plot configuration 
cf.set_config_file(margin = (10,10,10,50), dimensions = (None, 400))

In [139]:
# Parameters
symbol = 'AVAX-USDT'
exchange = 'GATEIO'
tick_size = 0.001

start = datetime.datetime(2022, 11, 1)
end = datetime.datetime(2022, 11, 10)

## Data

In [140]:
def load_data(table: str):
    print('Loading', table)
    return lakeapi.load_data(
        table = table,
        start = start,
        end = end,
        symbols = [symbol],
        exchanges = [exchange],
        drop_partition_cols = True,
    ).sort_values('received_time')

# Load l1 data = top of the order book
l1 = load_data('level_1')
l1 = l1.drop(columns = ['bid_0_size', 'ask_0_size'])
# Load trades
trades = load_data('trades')
trades = trades.drop(columns = ['trade_id', 'origin_time'])

Loading level_1


  0%|          | 0/9 [00:00<?, ?it/s]

Loading trades


  0%|          | 0/9 [00:00<?, ?it/s]

In [141]:
# Merge trades and l1 data
l1['future_bid'] = l1.bid_0_price.shift(-1)
l1['future_ask'] = l1.ask_0_price.shift(-1)


df = pd.merge_asof(
	left = trades.rename(columns = {'received_time': 'trade_received_time'}),
	right = l1.rename(columns = {'received_time': 'depth_received_time'}),
	left_on = 'trade_received_time',
	right_on = 'depth_received_time',
	tolerance = pd.Timedelta(minutes = 60),
)
df = df.dropna().reset_index(drop = True)
df['spread_ticks'] = (df.ask_0_price - df.bid_0_price) / tick_size
df['mid'] = (df.ask_0_price + df.bid_0_price) / 2
df['nominal'] = df.price * df.quantity
df.head(3)

Unnamed: 0,side,quantity,price,trade_received_time,depth_received_time,bid_0_price,ask_0_price,future_bid,future_ask,spread_ticks,mid,nominal
0,buy,0.1784,18.795,2022-11-01 15:05:27.616974336,2022-11-01 15:05:27.077800448,18.791,18.795,18.795,18.801,4.0,18.793,3.353028
1,sell,0.3149,18.795,2022-11-01 15:05:31.263184384,2022-11-01 15:05:30.577771776,18.795,18.796,18.789,18.796,1.0,18.7955,5.918546
2,sell,8.0788,18.793,2022-11-01 15:05:33.458042624,2022-11-01 15:05:33.361402624,18.791,18.796,18.791,18.796,5.0,18.7935,151.824888


In [142]:
# Detection logic
epsilon = tick_size
df['fake'] = (
	# Trade is fake when its inside spread (+- some epsilon, perhaps one tick)
	(df['price'] > df['bid_0_price'] + epsilon) & 
	(df['price'] < df['ask_0_price'] - epsilon) &
	# To prevent false positives, we also assume we didn't receive the next depth update yet
	(df['price'] > df['future_bid'] + epsilon) & 
	(df['price'] < df['future_ask'] - epsilon)
)

fake_volume = df.loc[df['fake'] == 1, 'quantity'].sum()
all_volume = df['quantity'].sum()

# Note that this method has some false positives, but still leads to better & cleaner data for most use cases
print('Fake trade percentage', df['fake'].mean() * 100)
print('Fake volume percentage', fake_volume / all_volume * 100)

Fake trade percentage 9.27493817175396
Fake volume percentage 20.233526632750124


In [151]:
df.head(20)

Unnamed: 0,side,quantity,price,trade_received_time,depth_received_time,bid_0_price,ask_0_price,future_bid,future_ask,spread_ticks,mid,nominal,fake
0,buy,0.1784,18.795,2022-11-01 15:05:27.616974336,2022-11-01 15:05:27.077800448,18.791,18.795,18.795,18.801,4.0,18.793,3.353028,False
1,sell,0.3149,18.795,2022-11-01 15:05:31.263184384,2022-11-01 15:05:30.577771776,18.795,18.796,18.789,18.796,1.0,18.7955,5.918546,False
2,sell,8.0788,18.793,2022-11-01 15:05:33.458042624,2022-11-01 15:05:33.361402624,18.791,18.796,18.791,18.796,5.0,18.7935,151.824888,True
3,buy,0.3419,18.796,2022-11-01 15:05:41.261735680,2022-11-01 15:05:39.952322560,18.792,18.796,18.795,18.796,4.0,18.794,6.426352,False
4,sell,0.3419,18.795,2022-11-01 15:05:50.715498752,2022-11-01 15:05:50.050700288,18.795,18.796,18.789,18.796,1.0,18.7955,6.426011,False
5,sell,0.1622,18.789,2022-11-01 15:06:00.239222784,2022-11-01 15:05:58.446351360,18.789,18.792,18.783,18.79,3.0,18.7905,3.047576,False
6,buy,0.342,18.79,2022-11-01 15:06:02.843207168,2022-11-01 15:06:02.270531072,18.785,18.79,18.79,18.795,5.0,18.7875,6.42618,False
7,buy,0.1269,18.794,2022-11-01 15:06:05.945119744,2022-11-01 15:06:05.053243904,18.79,18.795,18.792,18.796,5.0,18.7925,2.384959,False
8,buy,0.1622,18.796,2022-11-01 15:06:08.527315200,2022-11-01 15:06:07.371106048,18.794,18.796,18.796,18.802,2.0,18.795,3.048711,False
9,buy,0.1621,18.802,2022-11-01 15:06:30.563549184,2022-11-01 15:06:28.569323264,18.801,18.802,18.803,18.808,1.0,18.8015,3.047804,False


In [150]:
df[df.fake == 1][::5].head(10)

Unnamed: 0,side,quantity,price,trade_received_time,depth_received_time,bid_0_price,ask_0_price,future_bid,future_ask,spread_ticks,mid,nominal,fake
2,sell,8.0788,18.793,2022-11-01 15:05:33.458042624,2022-11-01 15:05:33.361402624,18.791,18.796,18.791,18.796,5.0,18.7935,151.824888,True
87,sell,7.9545,18.793,2022-11-01 15:12:25.572994048,2022-11-01 15:12:23.167094784,18.791,18.795,18.791,18.795,4.0,18.793,149.488919,True
240,sell,5.4106,18.838,2022-11-01 15:28:23.670013440,2022-11-01 15:28:23.350905856,18.835,18.84,18.835,18.841,5.0,18.8375,101.924883,True
381,sell,6.6425,18.783,2022-11-01 15:44:24.035673600,2022-11-01 15:44:23.952991744,18.78,18.786,18.78,18.786,6.0,18.783,124.766078,True
594,sell,1.8042,18.783,2022-11-01 16:04:04.307063552,2022-11-01 16:04:02.647850496,18.78,18.786,18.78,18.785,6.0,18.783,33.888289,True
691,sell,9.4353,18.89,2022-11-01 16:12:32.348203008,2022-11-01 16:12:27.175571968,18.882,18.898,18.883,18.898,16.0,18.89,178.232817,True
724,sell,18.9151,18.875,2022-11-01 16:17:20.780268288,2022-11-01 16:17:18.157560832,18.873,18.878,18.868,18.877,5.0,18.8755,357.022513,True
899,sell,7.5877,18.761,2022-11-01 16:29:32.215202816,2022-11-01 16:29:31.255592192,18.758,18.764,18.758,18.764,6.0,18.761,142.35284,True
1136,sell,9.9906,18.609,2022-11-01 16:37:46.089624064,2022-11-01 16:37:43.058072576,18.606,18.612,18.606,18.611,6.0,18.609,185.915075,True
1195,sell,74.668,18.678,2022-11-01 16:45:25.170956544,2022-11-01 16:45:22.065390336,18.672,18.685,18.675,18.682,13.0,18.6785,1394.648904,True


## Plots

Open the notebook in Binder to see the plots, Github sadly doesn't support Plotly plots.

In [199]:
# How real trade quantities look like (drop outliers above 99% percentile for better visualisation)
df[(df.fake == 0) & (df.quantity < df.quantity.quantile(0.99))]['quantity'].iplot(kind = 'hist', bins = 100, title = 'Real quantities')

In [200]:
# How fake quantities look like, notice the block of trades with quantities 5-10. Perhaps the quantities are randomly sampled from normal distribution. 
# We could use this for even better detection with less false positives.
df[(df.fake == 1) & (df.quantity < df.quantity.quantile(0.99))]['quantity'].iplot(kind = 'hist', bins = 100, title = 'Fake quantities')

In [201]:
df['fake_nominal'] = df['nominal'] * df['fake']
df['real_nominal'] = df['nominal']  - df['fake_nominal']
df.set_index('trade_received_time')[['real_nominal', 'fake_nominal']].cumsum()[::1000].iplot(yTitle = 'Volume in $')
df = df.drop(columns = ['fake_nominal', 'real_nominal'])

## More data

What data are available in the crypto-lake anonymous sample repository. Uncomment to explore:

In [152]:
# for table in ('level_1', 'trades', 'book', 'candles'):
# 	available_data = pd.DataFrame(lakeapi.list_data(table = table))
# 	print(table)
# 	display(available_data[['exchange', 'symbol', 'dt']].groupby(['exchange', 'symbol']).aggregate({'dt': ['first', 'last']}))

## Conclusion

On the observed trading pair probably fake volume accounts for 10% of trades and 20% of volume. That is a lot and can hurt model precisions, especially during low volume periods, when most trades can be fake. It looks like the fake volume is pretty stable in time, doesn't follow volume as eg. VWAP or POV algos do. The fake trades appear to have quantity sampled from normal distribution with mean around 8.