
#### A multiclass classification problem

by Aries P. Valeriano and Dave Emmanuel Q. Magno

## Executive Summary

The goal of this project is to a build a prediction model that make use of stock chart pattern, in particular double top to predict the next movement of stock price if it will decrease further, increase, or stay still within the next 10 days. If successful, traders can use this prediction model to make a data driven decision on their next trade.

To achieve this goal, we will create our own dataset first. This can be done by determining the minima and maxima of the time series dataset for every stock from various industry. Then, we will refer to it to detect the double top stock chart pattern. There are 5 points/prices that consists of maxima and minima that forms the pattern, these will become the descriptive features that we will use to predict the target feature which is the movement of stock price within the next 10 days. We also include the indexes of the start and end of the pattern, as well as the industry of the stocks where the pattern occurs.

Now that we have the dataset. We will perform data exploration to visualize the double top pattern and verify the multicollinearity of the descriptive features because obviously they are correlated to each other since each are part of double top pattern. Moreover, we performed also feature engineer to get additional features that could help increase the predictive accuracy of the model.

After data exploration, we proceed to predictive modelling. Note that the target feature consists of 3 levels. Thus, we will fit multiclass models to our dataset such as multiclass KNN, multinomial logistic regression, multiclass SVM etc.. Fortunately, this can be done simultaneously by using pycaret machine learning library, moreover it automatically split the dataset which we set to 80/20, and further perform data exploration such as normalization, in which we set it as minmax scaler, one hat encoding for nominal feature (industry). And as a result, the model the produces the highest F1 score is the gradient boosting classifier, a sequential ensemble approach with 0.5217 score. This imply that with 52.17% accuracy, we can predict the movement of stocks prices within 10 days after the double top stock chart pattern happen.

## Jupyter Display Settings

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [2]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

## Prerequisites

In [3]:
from collections import Counter
from typing import Union
from itertools import combinations

from pycaret.classification import *
from sklearn.datasets import dump_svmlight_file
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedKFold, train_test_split
from sklearn.metrics import precision_score
from imblearn.under_sampling import NearMiss
import xgboost as xgb
from xgboost import XGBClassifier

import pandas as pd
from dfply import *

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

from price_detection_tools import import_

## Tools

In [4]:
@dfpipe
def extract_date(df_: pd.DataFrame) -> pd.DataFrame:
    df = df_.copy(deep=True)
    for i, col_name in zip(range(2), ['dateF', 'dateE']):
        df[col_name] = df.date.apply(lambda x: x[i])
    return df

@dfpipe
def drop_(df_: pd.DataFrame) -> pd.DataFrame:
    to_drop = ['fw_ret_1', 'fw_ret_2', 'fw_ret_3']
    return df_.drop([to_drop])

@dfpipe
def get_average(df_: pd.DataFrame) -> pd.DataFrame:
    df = df_.copy(deep=True)
    f1_idx = df.columns.get_loc('f1')
    f5_idx = df.columns.get_loc('f5')
    averages = df.iloc[:, f1_idx:f5_idx + 1].mean(axis=1)
    averages.rename('averages', inplace=True)
    return pd.concat([df_, averages], axis=1)

def lump_categories(data: Union[pd.DataFrame,
                                pd.Series],
                    percentage: float = 0.010):
    
    def set_threshold(df: pd.DataFrame,
                      percentage: float = 0.010) -> float:
        """Sets threshold to be a percentage of the shape
        of dataframe."""
        return df.shape[0] * percentage

    if isinstance(data, pd.DataFrame): pass
    if isinstance(data, pd.Series):
        data = data.to_frame()
    return data.apply(lambda x: x.mask(
        x.map(x.value_counts()) < set_threshold(df_copy),
        'Others'))

def encode_label(series: pd.Series):
    label_encoder = LabelEncoder()
    label_encoder.fit(series)
    return label_encoder.transform(series)

def get_height(fA_: pd.Series,
               fB_: pd.Series) -> pd.DataFrame:
    return np.abs(fA_ - fB_)

@dfpipe
def add_height_features(df_: pd.DataFrame) -> pd.DataFrame:
    df = df_.copy(deep=True)
    heights = ['h{}'.format(i + 1) for i in range(10)]
    feats = ['f{}'.format(i + 1) for i in range(5)]
    
    for height, comb in zip(heights,
                            combinations(feats, 2)):
        df[height] = get_height(df[comb[0]],
                                df[comb[1]])
    return df

## Data Description

It contains a target feature (label) that have 3 levels, decrease "1", neutral "2", increase "3" and 8 descriptive features in which 5 of it (f1, f2, f3, f4, f5) consist of minima and maxima that forms a double top stock chart pattern, 2 of it (dateF, dateE) are the indexes of the start and end of the pattern lastly, industry of the stock where the pattern occur.

However, before arriving at this dataset, web scraping of stocks historical dataset for various industries at https://finance.yahoo.com/ are performed, see above table, then closing price from it was utilize. Moreover, minima and maxima of time series dataset for every stock were determined. Next, detection of double top stock chart pattern by setting threshold for the minima and maxima that forms the pattern, and lastly, determine its corresponding target feature by comparing the highest maxima or lowest minima within the pattern and the maximum or minimum prices within the next 10 days after the pattern is observed. Target is labeled increase "2" if the maximum price within the next 10 days is greater than the highest maxima within the pattern, decrease "0" if the minimum price within the next 10 days is lesser than the lowest minima within the pattern, neutral "0" if neither or both happens. 

## Data Exploration and Preprocessing

In here, we have done feature engineer. Created 3 additional descriptive features, which are the the absolute difference between f1 and f3 , also between f3 and f5, then we took the sum of the values from f1 to f5. These features were named d1, d2, and sum respectively. These additional features will help increase the predictive accuracy of the model built later on.

In [6]:
df = import_('trial.csv').drop(['increment', 'ema', 'window'], axis=1)

In [7]:
df_copy = df.copy(deep=True)

#### Label encoding of 'industry'

#### Lump categories

In [9]:
df_copy['industry_lumped'] = lump_categories(df_copy['industry'])

In [10]:
df_copy = df_copy.drop('industry', axis=1)

In [11]:
df_copy['industry_coded'] = encode_label(df_copy['industry_lumped'])

In [12]:
df_copy = df_copy.drop('industry_lumped', axis=1)

In [13]:
data = (df_copy >>
         extract_date >>
         drop(['date']) >>
         drop(['fw_ret_1', 
               'fw_ret_2',
               'fw_ret_3']) >>
         add_height_features)

#### Get the size or width of the whole pattern

In [14]:
data['pattern_width'] = np.abs(data.dateF - data.dateE)

In [15]:
data['w1'] = get_height(data.idx1, data.idx3)
data['w2'] = get_height(data.idx3, data.idx5)
data['w3'] = get_height(data.idx2, data.idx4)

In [16]:
indices = ['idx1', 'idx2', 'idx3', 'idx4', 'idx5']

In [17]:
data = data.drop(indices, axis=1)

#### Data validation

Drop values with 0 widths, invalid pattern.

In [18]:
data = data[data.w1 != 0]
data = data[data.w2 != 0]
data = data[data.w3 != 0]

#### Drop 'dateF' and 'dateE' after getting pattern width

In [19]:
data = data.drop(['dateF', 'dateE'], axis=1)

In [20]:
Counter(data.label)

Counter({1: 254, 3: 328, 2: 90})

In [None]:
undersample = NearMiss(version=3)

In [None]:
def undersample_(X: pd.DataFrame, 
                 y: pd.Series,
                 version: int = 1):
    
    undersample = NearMiss(version=version)
    return undersample.fit_resample(X, y)

In [None]:
X, y = undersample_(data.drop('label', axis=1),
                    data['label'])

In [None]:
Counter(y)

#### Change the coding of the label

In [21]:
X, y = data.drop('label', axis=1), data['label']

In [22]:
mapping = {1: 0, 2: 1, 3: 2}

In [23]:
y = y.replace(mapping)

In [24]:
to_filter = ['f1', 'f2', 'f3', 'f4', 'f5', 
             'industry_coded', 'pattern_width',
             'h1', 'h2', 'h4', 'h6', 'h9',
             'w1', 'w2', 'w3']

In [25]:
X = X[to_filter]

In [None]:
to_corr = ['w1', 'w2', 'w3', 'pattern_width', 'f1',
           'h1', 'h2', 'h4', 'h6', 'h9']

In [None]:
fig=plt.figure(figsize=(12,10), dpi= 100)
sns.heatmap(X[to_corr].corr(method='spearman'), annot=True)
plt.show()

## Model selection

The dataset we have consists of quantitative features and a single categorical feature which is the target feature. The target feature contains multiple levels. Therefore, we will fit several models that are multiclass to our dataset, in particular multiclass KNN, multinomial logistic regression, multiclass SVM etc. to find the best predictive model of this project. Fortunately, we can fit these models to our dataset simultaneously using pycaret machine learning library. 
Moreover, pycaret also automatically normalized then splits dataset if specified, do one hat encoding for nominal features, perform cross validation, tuned hyperparameter for every model etc.. In our case, we specify the normalization method as minmax scaler then split it to 80% training set and 20% testing set, then let it perform 5 fold cross validation with 10 repetition.

In [26]:
data = pd.concat([X, y], axis=1)

In [27]:
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=13)

In [28]:
exp_name = setup(data=data,
                 target='label',
                 normalize=True,
                 normalize_method='minmax',
                 train_size=0.8,
                 data_split_stratify=True,
                 fold_strategy=cv,
                 remove_outliers=True,
                 use_gpu=True,
                 session_id=13,
                 pca=True,
                 pca_method='linear')

Unnamed: 0,Description,Value
0,session_id,13
1,Target,label
2,Target Type,Multiclass
3,Label Encoded,"0: 0, 1: 1, 2: 2"
4,Original Data,"(672, 16)"
5,Missing Values,False
6,Numeric Features,15
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [29]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.5067,0.0,0.3595,0.4848,0.3922,0.0452,0.0831,0.0108
lr,Logistic Regression,0.5049,0.5615,0.3537,0.4804,0.3737,0.0319,0.0743,0.0328
lda,Linear Discriminant Analysis,0.489,0.5664,0.3732,0.4512,0.3961,0.0504,0.0734,0.012
qda,Quadratic Discriminant Analysis,0.4794,0.5855,0.385,0.4527,0.425,0.0623,0.0725,0.0116
nb,Naive Bayes,0.4727,0.5812,0.3801,0.4421,0.4203,0.0559,0.0639,0.0186
rf,Random Forest Classifier,0.4669,0.5469,0.3762,0.4625,0.4476,0.0597,0.0618,0.8892
ada,Ada Boost Classifier,0.4631,0.5368,0.3785,0.4511,0.4436,0.0599,0.0618,0.0984
et,Extra Trees Classifier,0.462,0.5434,0.38,0.4559,0.4465,0.0591,0.0605,0.6998
svm,SVM - Linear Kernel,0.4582,0.0,0.3713,0.3863,0.3583,0.0419,0.0571,0.1156
lightgbm,Light Gradient Boosting Machine,0.4524,0.5417,0.3719,0.448,0.442,0.0505,0.0513,0.1344


In [None]:
get_config('X')

Since this project aims to predict stock prices given double top stock chart pattern for trading purposes. Both false negative and false positive are crucial. That is, in the case of false negative, if we predict a decrease in price after the pattern then decided to sale the stocks because we won't get any more profit however, it actually increases, then we just lose the opportunity to earn more. 

On the other hand, in the case of false positive, if we predict an increase in price after the pattern then decided to buy stocks so we can sale it during the increase however, it actually decreases, then we just lose some money. Also, the target feature has imbalanced classes. Therefore, we will emphasize the F1 score over accuracy and any other performance metrics for this project to measure the predictive power of the model built.

The above output shows the list of predictive performance for several models after we fit it to our data. Notice that the model that produces the highest F1 score is the Gradient descent classifier, a sequential ensemble approach, with 0.5217 score. This imply that the model we built can predict the target feature given f1, f2, f3, f4, f5, dateF, dateE, and industry with 52.17% accuracy.

Moreover, we will further interpret other performance metrics but this is just for better understanding of the predictive model performance. After all, we already interpreted F1 score, the appropriate performance metric for this project.

Now, notice that the model that produces the highest accuracy is still Gradient boosting classifier with value 0.5616, followed by Ada boost classifier with value 0.5579, both are sequential ensemble approach. These implies that the models we built can predict the target feature given f1, f2, f3, f4, f5, dateF, dateE, and industry with 56.16% and 55.79% accuracy respectively. 

Ada Boost Classifier also produces the highest Precision with value 0.5210. This suggest that for the number of predictions that the model made, 52.19% of it are correct. Whereas, Gradient boosting classifier produces the highest AUC and Recall (Sensitivity) with values 0.6750 and 0.4883 respectively. AUC value suggest that the model have 67.50% accuracy to predict the target feature that will or will not occur. And recall value imply that, for the target feature that should occur, we predicted 48.83% of it.

## Tuned hyperparameters

In [None]:
best_model

In [None]:
plot_model(best_model, plot='confusion_matrix')

In [None]:
plot_model(best_model)

In [None]:
plot_model(best_model, plot='class_report')

In [None]:
plot_model(best_model, plot='pr')

In [None]:
plot_model(best_model, plot='boundary')

In [None]:
plot_model(best_model, plot='learning')

## Evaluate predictive accuracy of the model built

In [None]:
test_scores = predict_model(best_model)

In [None]:
test_scores

In [None]:
test_scores.Score.mean()

The predictive model built when use for test set, gives us a predictive accuracy of 62.40%. This suggest that we predicted the movement of stock prices given new quiries of double top stock chart pattern (test set) with 62.40% accuracy.

## Conclusion and Recommendation

The dataset created consists of target feature with 3 levels (decrease/1, neutral/2, increase/3) and 7 descriptive features, in which 5 of them (f1, f2, f3, f4, f5) forms a double stock chart pattern and the other two are indexes of the start and end of the pattern. After we tried to fit several multiclass models on this dataset, we found out that its accuracy is less than 50%. Thus, we performed feature engineer by taking the absolute difference between f1 and f3, f3 and f5, and get the average from f1 to f5. These 3 additional descriptive features actually helped the predictive model we built to increase its accuracy from below 50% to a little bit higher than 50%. The predictive model that produces this result is the gradient boosting classifier, a sequential ensemble approach that gives an F1 score of 0.5217. This only imply that the predictive model built can predict the movement of the stock prices given double top stock chart pattern with 52.17% accuracy. Furthermore, when model performance was evaluated with the test set, it produces a 62.40% predictive accuracy.

## References

* https://medium.com/analytics-vidhya/accuracy-vs-f1-score-6258237beca2
* https://pycaret.org/classification/
* https://finance.yahoo.com/