# Labeling: Tail Sets

## Abstract

Tail set labels are a classification labeling technique introduced in the following paper: "[Huerta, R., Corbacho, F. and
Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns.
Algorithmic Finance, 2(1), pp.45-58.](https://content.iospress.com/download/algorithmic-finance/af016?id=algorithmic-finance%2Faf016)

A tail set is defined to be a group of assets whose volatility-adjusted price change is in the highest or lowest
quantile, for example the highest or lowest 5%.

A classification model is then fit using these labels to determine which stocks to buy and sell, for a long / short
portfolio.

## How it works

We label the y variable using the tail set labeling technique, which makes up the positive and negative (1, -1) classes
of the training data. The original paper investigates the performance of 3 types of metrics on which the tail sets are
built:

1. Real returns
2. Residual alpha after regression on the sector index
3. Volatility-adjusted returns

For our particular implementation, we have focused on the volatility-adjusted returns.

An input DataFrame of prices is converted to returns, which can have volatility adjustment applied. The formula for volatility-adjusted return is:

$$r(t - t', t) = \frac{R(t-t',t)}{vol(t)}$$

We provide two implementations for estimations of volatility, first the exponential moving average of the mean absolute returns, and second the traditional standard deviation. The paper suggests a 180 day window period. 

The volatility adjusted return of each stock is assigned to a quantile relative to other returns in the row i.e. same timestamp. The top and bottom quantiles are then labeled as the positive and negative classes, respectively.

## How to use these labels in practice?

The tail set labels from the code above returns the names of the assets which should be labeled with a positive or
negative label. It's important to note that the model you  would develop is a many to one model, in that it has many
x variables and only one y variable. The model is a binary classifier.

The model is trained on the training data and then used to score every security in the test data (on a given day).
Example: On December 1st 2019, the strategy needs to rebalance its positions, we score all 100 securities in our tradable
universe and then rank the outputs in a top-down fashion. We form a long / short portfolio by going long the top 10
stocks and short the bottom 10 (equally weighted). We then hold the position to the next rebalance date.

---
## Examples of use

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import yfinance as yf

from mlfinlab.labeling import TailSetLabels

In [2]:
# Load price data for 20 stocks
tickers = "AAPL MSFT COST PFE SYY F GE BABA AMD CCL ZM FB WFC JPM NVDA CVX TWTR ACI GPS KO"

data = yf.download(tickers, start="2019-01-20", end="2020-05-25", group_by="ticker")
data = data.loc[:, (slice(None), 'Adj Close')]
data.columns = data.columns.droplevel(1)
data.head()

[*********************100%***********************]  20 of 20 completed


Unnamed: 0_level_0,BABA,PFE,AAPL,MSFT,CVX,F,KO,FB,GE,JPM,COST,GPS,ACI,AMD,ZM,WFC,TWTR,SYY,NVDA,CCL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-01-22,152.149994,39.946537,150.266403,103.568062,105.162872,7.837517,45.439438,147.570007,8.272988,98.963676,209.413116,22.838694,5100.0,19.76,,46.484818,32.25,60.661575,148.035126,51.750248
2019-01-23,152.029999,39.842587,150.87413,104.577469,104.273567,7.689988,45.96315,144.300003,8.33986,98.71373,209.107468,23.241623,5100.0,19.799999,,46.727222,30.969999,60.884239,148.552521,51.512947
2019-01-24,155.860001,38.699097,149.678253,104.077667,106.258125,7.929724,45.410866,145.830002,8.387626,98.7714,207.352509,23.122576,5100.0,20.85,,46.596699,31.610001,60.535717,157.060287,52.224846
2019-01-25,159.210007,38.406132,154.638153,105.028282,105.986641,8.169458,45.106155,149.009995,8.750644,99.396294,206.129929,23.516348,5100.0,21.93,,46.736546,32.900002,60.041988,159.358887,52.689953
2019-01-28,158.919998,37.357147,153.207047,102.980049,105.003731,7.985046,44.91571,147.470001,8.530922,99.867378,207.806046,23.598763,5100.0,20.18,,46.447533,33.130001,60.284012,137.328262,53.53474


In [3]:
# Create tail set labels with mean absolute deviation as the volatility adjustment.
labels = TailSetLabels(data, n_bins=10, vol_adj='mean_abs_dev', window=180)
pos_set, neg_set, matrix_set = labels.get_tail_sets()

In [4]:
# Get the positive set, of the top 10% returns for each day.
pos_set.head()

Date
2020-01-06      [GPS, ZM]
2020-01-07     [ZM, TWTR]
2020-01-08    [MSFT, SYY]
2020-01-09     [KO, COST]
2020-01-10     [PFE, GPS]
dtype: object

In [5]:
# Get the negative set, of the lowest 10% returns for each day.
neg_set.head()

Date
2020-01-06    [WFC, CCL]
2020-01-07    [CVX, JPM]
2020-01-08     [CVX, GE]
2020-01-09    [PFE, GPS]
2020-01-10     [GE, JPM]
dtype: object

In [6]:
# All labels for the day.
matrix_set.head()

Unnamed: 0_level_0,BABA,PFE,AAPL,MSFT,CVX,F,KO,FB,GE,JPM,COST,GPS,ACI,AMD,ZM,WFC,TWTR,SYY,NVDA,CCL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2020-01-06,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,-1,0,0,0,-1
2020-01-07,0,0,0,0,-1,0,0,0,0,-1,0,0,0,0,1,0,1,0,0,0
2020-01-08,0,0,0,1,-1,0,0,0,-1,0,0,0,0,0,0,0,0,1,0,0
2020-01-09,0,-1,0,0,0,0,1,0,0,0,1,-1,0,0,0,0,0,0,0,0
2020-01-10,0,1,0,0,0,0,0,0,-1,-1,0,1,0,0,0,0,0,0,0,0


In [7]:
# See the numerical returns.
labels.vol_adj_rets.dropna().head()

Unnamed: 0_level_0,BABA,PFE,AAPL,MSFT,CVX,F,KO,FB,GE,JPM,COST,GPS,ACI,AMD,ZM,WFC,TWTR,SYY,NVDA,CCL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2020-01-06,-0.121334,-0.163835,0.745082,0.311705,-0.42372,-0.510353,-0.056495,1.680486,0.905581,-0.09861,0.039689,2.396394,0.0,-0.227461,1.907438,-0.650362,0.259407,-0.217696,0.266647,-2.423721
2020-01-07,0.336044,-0.430076,-0.445724,-1.10447,-1.592236,0.917607,-1.188172,0.196681,-0.48104,-2.095482,-0.23071,-0.021452,0.0,-0.153854,0.959547,-0.901826,1.892687,-1.346047,0.769067,0.246498
2020-01-08,0.126608,1.022932,1.499143,1.883782,-1.41571,0.0,0.286338,0.91888,-0.595934,0.950251,1.652935,0.115619,0.0,-0.467444,0.391693,0.33165,1.048754,1.894668,0.121107,0.379514
2020-01-09,1.276731,-0.56311,1.950353,1.471037,-0.201126,0.103904,2.743687,1.2896,-0.165231,0.448849,2.271257,-1.860596,0.0,1.255253,0.04249,-0.18838,0.348897,0.196094,0.708613,0.723627
2020-01-10,0.686095,1.952353,0.211654,-0.55265,-1.136807,-0.105099,0.524137,-0.100989,-1.331385,-1.230105,-1.041752,1.082667,0.0,-0.879143,0.286838,-0.486054,-0.907804,0.570522,0.349014,-0.5782


### Error Handling

Errors will be raised if inputs are invalid.

In [8]:
# If number of bins is greater than the width of the price data i.e. exceeds the number of stocks.
try:
    TailSetLabels(data[:100], n_bins=50)
except Exception as exc:
    print(exc)

# If window is either not an int or too small.
try:
    TailSetLabels(data[:100], n_bins=10, vol_adj='stdev', window='str')
except Exception as exc:
    print(exc)
try:
    TailSetLabels(data[:100], n_bins=10, vol_adj='stdev', window=200)
except Exception as exc:
    print(exc)

n_bins exceeds the number of stocks!
If vol_adj is not None, window must be int.
Length of price data must be greater than the window.


---
## Conclusion

This notebook presents the tail sets labeling method. This method is useful in identifying outliers in the returns for a group of stocks during a given day. The user chooses the number of quantiles, and the top and bottom quantiles are labeled as the positive and negative tail sets, respectively. This method can be used in training data for classification. A strategy can be adopted of going long the predicted positive tail set and short the negative one.

## References

1. Huerta, R., Corbacho, F. and Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns. Algorithmic Finance, 2(1), pp.45-58.