# Notebook Instructions

1. All the <u>code and data files</u> used in this course are available in the downloadable unit of the <u>last section of this course</u>.
2. You can run the notebook document sequentially (one cell at a time) by pressing **shift + enter**. 
3. While a cell is running, a [*] is shown on the left. After the cell is run, the output will appear on the next line.

This course is based on specific versions of python packages. You can find the details of the packages in <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank" >this manual</a>.

## Class Weights in Decision Trees

When we are building a decision tree model, it can happen that the dataset provided to the model may have very few data points for it's most important classes. In such an instance, the decision tree algorithm will try to maximize the accuracy of the most common labels. 

In order to adjust for this issue, we re-assign weights to the data points of the most important labels. This can be done in the scikit-library using the class_weight argument to the decision tree classifier. Let us take an example to illustrate this.

#### Example:

We will input raw data of ACC Ltd. stock from a csv file. The data consists of Open-High-Low-Close prices and Volume data. Predictor and target variables are created using this raw data. 

In [13]:
import pandas as pd
df = pd.read_csv('../2ySOLdata1h.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
df.set_index('timestamp', inplace=True)
df = df.rename(columns={"Open": "OPEN", "High": "HIGH", "Low": "LOW", "Close": "CLOSE"})
df = df.iloc[:, :-1]
df.tail()

Unnamed: 0_level_0,OPEN,HIGH,LOW,CLOSE,Volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-12-31 02:00:00,102.718,103.063,101.208,101.411,731765
2023-12-31 03:00:00,101.413,101.85,100.044,100.738,970135
2023-12-31 04:00:00,100.734,100.939,99.635,100.743,858035
2023-12-31 05:00:00,100.734,102.533,100.532,101.974,879783
2023-12-31 06:00:00,101.982,102.576,101.27,101.366,619216


#### Computing Technical Indicators and Daily Future Returns

We compute the values for the Average Directional Index (ADI), Relative Strength Index (RSI), and Simple Moving Average (SMA) using the TA-Lib package. These will be used as predictor variables in the decision tree model. Next, we compute the daily future returns on the close price. The code is shown below.


In [14]:
import numpy as np
import talib as ta

# Import and filter warnings
import warnings
warnings.filterwarnings("ignore")

df['ADX'] = ta.ADX(df['HIGH'].values, df['LOW'].values,
                   df['CLOSE'].values, timeperiod=14)
df['RSI'] = ta.RSI(df['CLOSE'].values, timeperiod=14)
df['SMA'] = ta.SMA(df['CLOSE'].values, timeperiod=20)

df['Return'] = df['CLOSE'].pct_change(1).shift(-1)
df = df.dropna()

df.tail(15)

Unnamed: 0_level_0,OPEN,HIGH,LOW,CLOSE,Volume,ADX,RSI,SMA,Return
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-12-30 15:00:00,102.969,104.33,102.953,103.992,1020147,17.086782,49.50197,104.107,0.003289
2023-12-30 16:00:00,103.991,104.5,102.563,104.334,1444959,16.585758,51.026384,104.02805,-0.008281
2023-12-30 17:00:00,104.326,105.0,102.89,103.47,1333327,15.718191,47.153663,103.9553,0.002068
2023-12-30 18:00:00,103.469,103.961,102.654,103.684,855827,15.085637,48.20228,103.91995,-0.003704
2023-12-30 19:00:00,103.684,103.814,102.85,103.3,535552,14.498265,46.422238,103.8302,0.00062
2023-12-30 20:00:00,103.301,104.2,102.9,103.364,594480,13.60262,46.775024,103.68715,0.000619
2023-12-30 21:00:00,103.36,104.402,103.325,103.428,707664,12.674634,47.149788,103.56135,-0.008392
2023-12-30 22:00:00,103.418,103.81,101.82,102.56,737844,12.934622,42.753015,103.36635,-0.006523
2023-12-30 23:00:00,102.559,102.679,101.5,101.891,591313,13.397754,39.681626,103.2156,0.000785
2023-12-31 00:00:00,101.891,102.436,101.268,101.971,565171,13.989803,40.234552,103.132,0.007316


#### Categorize Returns into Multiple Classes

We define a function called 'returns_to_class' using nested If..else statement to categorize returns into multiple classes. We also specify the range for the returns for each class in this function. This function is then applied on our dataframe, df to get the multi-class target variable.


In [15]:
def returns_to_class(df):
    if df.Return <= 0.0:
        return 0
    elif df.Return > 0.0 and df.Return < 0.02:
        return 1
    elif df.Return > 0.02 and df.Return < 0.03:
        return 2
    else:
        return 3


df['Class'] = df.apply(returns_to_class, axis=1)
df.tail(15)

Unnamed: 0_level_0,OPEN,HIGH,LOW,CLOSE,Volume,ADX,RSI,SMA,Return,Class
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2023-12-30 15:00:00,102.969,104.33,102.953,103.992,1020147,17.086782,49.50197,104.107,0.003289,1
2023-12-30 16:00:00,103.991,104.5,102.563,104.334,1444959,16.585758,51.026384,104.02805,-0.008281,0
2023-12-30 17:00:00,104.326,105.0,102.89,103.47,1333327,15.718191,47.153663,103.9553,0.002068,1
2023-12-30 18:00:00,103.469,103.961,102.654,103.684,855827,15.085637,48.20228,103.91995,-0.003704,0
2023-12-30 19:00:00,103.684,103.814,102.85,103.3,535552,14.498265,46.422238,103.8302,0.00062,1
2023-12-30 20:00:00,103.301,104.2,102.9,103.364,594480,13.60262,46.775024,103.68715,0.000619,1
2023-12-30 21:00:00,103.36,104.402,103.325,103.428,707664,12.674634,47.149788,103.56135,-0.008392,0
2023-12-30 22:00:00,103.418,103.81,101.82,102.56,737844,12.934622,42.753015,103.36635,-0.006523,0
2023-12-30 23:00:00,102.559,102.679,101.5,101.891,591313,13.397754,39.681626,103.2156,0.000785,1
2023-12-31 00:00:00,101.891,102.436,101.268,101.971,565171,13.989803,40.234552,103.132,0.007316,1


#### View the Multi-Class Distribution

Once we have defined the different classes for the target variable, we can see their distribution of Returns using the groupby method. As can be observed, out of the total data points majority of them (i.e. 126 data points) belong to '0' class which signifies negative returns. On the other hand, there are only 11 and 1 datapoint belonging to the '2' and the '3' class respectively.

In [16]:
df.groupby('Class').count()

Unnamed: 0_level_0,OPEN,HIGH,LOW,CLOSE,Volume,ADX,RSI,SMA,Return
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,8827,8827,8827,8827,8827,8827,8827,8827,8827
1,8027,8027,8027,8027,8027,8027,8027,8027,8027
2,360,360,360,360,360,360,360,360,360
3,257,257,257,257,257,257,257,257,257


#### Create Predictor Variables and Target Variable

Let us now define our predictors variables, X and the target variable, y for building a decision tree model.

In [17]:
X = df[['ADX', 'RSI', 'SMA']]
y = df.Class

  
We will consider two scenarios:   

1) Building a decision tree model without applying the class weights and    
2) Building a decision tree model with class weights.


### Scenario 1 - Build a decision tree model without applying the Class weights 

In [18]:
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

# Split into Train and Test datasets
split_percentage = 0.8
split = int(split_percentage*len(X))
# Train data set
X_train = X[:split]
y_train = y[:split]
# Test data set
X_test = X[split:]
y_test = y[split:]

#print (X_train.shape, y_train.shape)
#print (X_test.shape, y_test.shape)

# Fit a model on train data
clf = DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=5)
clf = clf.fit(X_train, y_train)

# Use the trained model to make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the model performance
report = classification_report(y_test, y_pred)
print(report)

# Here, the warning is occurring because some labels in y_test don't appear in y_pred. In this case, the number of observations
# for label 2 and label 3 are very few and they might not be occurring in the y_pred.

              precision    recall  f1-score   support

           0       0.50      0.84      0.62      1724
           1       0.51      0.18      0.27      1652
           2       0.00      0.00      0.00        71
           3       0.00      0.00      0.00        48

    accuracy                           0.50      3495
   macro avg       0.25      0.25      0.22      3495
weighted avg       0.48      0.50      0.43      3495



As can be seen from the output of the classification report, the decision tree algorithm tries to maximize the accuracy of the most common labels and does not give good predictions on the underrepresented labels.

### Scenario 2 - Build a decision tree model with Class Weights 

Let us use the class_weight parameter when defining the decision tree classifier to correct for the underrepresented labels.

We can assigned class_weight = 'balanced'. This re-weighting of the data points causes the classes to appear with equal frequency.

As can be seen from the output of the classification report, using class weight makes the decision tree algorithm achieve higher accuracy on the underrepresented labels which were labels '2'and '3' in this case. Although after class weights are changed, a model which otherwise shows good performance can suddenly appear to be less effective.

In [19]:
# Split into Train and Test datasets
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

split_percentage = 0.8
split = int(split_percentage*len(X))
# Train data set
X_train = X[:split]
y_train = y[:split]
# Test data set
X_test = X[split:]
y_test = y[split:]

# Fit a model on train data
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_leaf=5,
                             class_weight='balanced')
clf = clf.fit(X_train, y_train)

# Use the trained model to make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the model performance
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1724
           1       0.49      0.48      0.48      1652
           2       0.04      0.80      0.07        71
           3       0.02      0.15      0.04        48

    accuracy                           0.24      3495
   macro avg       0.14      0.36      0.15      3495
weighted avg       0.23      0.24      0.23      3495



You can try this model yourself on a new dataset to see how it works. In the next unit, there will be an interactive exercise. All the best!