<div class="alert alert-block alert-info">
<b><h2><center>Data Science Project - Buy/Hold/Sell prediction</center></h2></b>


Step by step explanation</div>

### Detailed explanation

The idea of this project is to determine whether we should buy, sell or hold a specific stock. For that, we are going to find local minimum and local maximum points (buy and sell, respectively), label every other data point as ''hold'' and then train a classification machine learning algorithm.

Therefore, we will use 10 brazilian stocks with the highest volume negotiated and train three classifiers: Logistic Regression, Random Forest and (nonlinear) SVC, whose task is to classify whether the price of a specific day is a buy oportunity, a hold situation, or a sell spot.

This notebook will be used to demonstrate with more details the entire process of the principal algorithm, which is encoded as classes at the *simulation.py* file, with the *fit_models.py* and *obj_functions_classifiers.py* files as complementary material. Here, we will use PETR4 and LogisticRegression as an example and execute every function step by step.

Summary:
>[Data preparation](#datapreparation)   
>[Train time](#modeltraining)  
>[Profitability](#profitability) 

In [31]:
import yfinance as yf
import numpy as np
import pandas as pd
from scipy.signal import argrelextrema

import plotly.graph_objects as go
import plotly.express as px

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

from sklearn.metrics import confusion_matrix

from sklearn.preprocessing import StandardScaler, MinMaxScaler

import optuna

import warnings
warnings.filterwarnings('ignore')

optuna.logging.set_verbosity(optuna.logging.WARNING)

As a first step, let's see the general structure of our dataset. For every stock, we will have the opening, closing, maximum and minimum price for each day, as well as the volume negotiated, dividends payed and stock splits.

In [32]:
stock = yf.Ticker(f'PETR4.SA')
df = stock.history(period='max')
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-03,2.139587,2.139587,2.139587,2.139587,35389440000,0.0,0.0
2000-01-04,2.021228,2.021228,2.021228,2.021228,28861440000,0.0,0.0
2000-01-05,2.000833,2.000833,2.000833,2.000833,43033600000,0.0,0.0
2000-01-06,1.993914,1.993914,1.993914,1.993914,34055680000,0.0,0.0
2000-01-07,2.003018,2.003018,2.003018,2.003018,20912640000,0.0,0.0


<a id="datapreparation"></a>

<div class="alert alert-block alert-info">
<b><h2><center>1. Data preparation</center></h2></b>


</div>

The idea behind this section is to enhance our dataset, adding new features to it so we can make our Machine Learning algorithm more powerful.

### 1.1 - Splitting

We start by splitting our dataset between train (70%) and test (30%) to avoid any kind of modifications that may lead to data leakage. The train set contains the past data about the stock, while the test one is more recent. 

From now on, I will use the train set to do any explanation and plotting charts.

In [33]:
train_df = df[0:round(0.7*len(df))].copy()
test_df = df[len(train_df):].copy()

### 1.2 - Normalized price

We can create a column that uses all four available prices for each day and put them into [0, 1] interval by defining a normalized price like follows:

$Norm. Price = \frac{Close - Low}{High - Low}$

If the normalized price is close to 0, then the price ended the day close to the minimum value. On the other hand, a 1 value means it ended next to the maximum. This feature encloses all the information about the four prices, which is better than just using one of them (such as close), and it's not sensitive to stock splits.

In [34]:
def normalized_price(row):
    
    cl = row['Close'] - row['Low']
    hl = row['High'] - row['Low']
    
    if hl == 0:
        return 1.0
    else:
        return cl/hl
    
train_df['normalized_price'] = train_df.apply(normalized_price, axis=1)
test_df['normalized_price'] = test_df.apply(normalized_price, axis=1)

train_df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,normalized_price
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-03,2.139587,2.139587,2.139587,2.139587,35389440000,0.0,0.0,1.000000
2000-01-04,2.021228,2.021228,2.021228,2.021228,28861440000,0.0,0.0,1.000000
2000-01-05,2.000833,2.000833,2.000833,2.000833,43033600000,0.0,0.0,1.000000
2000-01-06,1.993914,1.993914,1.993914,1.993914,34055680000,0.0,0.0,1.000000
2000-01-07,2.003018,2.003018,2.003018,2.003018,20912640000,0.0,0.0,1.000000
...,...,...,...,...,...,...,...,...
2015-11-13,3.777290,3.807508,3.641307,3.661453,67000600,0.0,0.0,0.121212
2015-11-16,3.681598,3.878017,3.681598,3.878017,56587800,0.0,0.0,1.000000
2015-11-17,3.928381,3.958599,3.847799,3.908235,52072800,0.0,0.0,0.545456
2015-11-18,3.933418,4.044219,3.923345,3.938455,38137700,0.0,0.0,0.125001


### 1.3 - Local min and max

Using scipy library, we are able to look for local minimum and maximum within a specified window. Before that, let's visualize our "Close" prices in a line chart.

In [35]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=train_df.index, y=train_df['Close'], mode='lines', showlegend=False))
fig.update_xaxes(title='Date')
fig.update_yaxes(title='Close')
fig.update_layout(title='Close price time series')
fig

Then we look for those interesting points and add the "target" column to our DataFrame.

In [36]:
def local_max_and_min(df):

    x = np.array(df['Close'])

    # for local maxima
    lmax = argrelextrema(x, np.greater, order=15, mode='wrap')

    # for local minima
    lmin = argrelextrema(x, np.less, order=15, mode='wrap')

    labels = np.zeros(len(df['Close']))

    idx = 0
    for c in range(0, len(df['Close'])):
        if idx in lmin[0]:
            labels[idx] = 0
        elif idx in lmax[0]:
            labels[idx] = 1
        else:
            labels[idx] = 2
        idx += 1

    return labels

train_df['target'] = local_max_and_min(train_df)
test_df['target'] = local_max_and_min(test_df)

After that, we can plot another line chart containing the new information. The green dots are local minimums and indicate a buying spot, while the red ones are local maximums and represent a selling spot.

In [37]:
for c in range(len(train_df)):
    target_value = train_df.iloc[c]['target']
    if target_value == 0:
        fig.add_trace(go.Scatter(x=[train_df.iloc[c].name], y=[train_df.iloc[c]['Close']], mode='markers', marker=dict(color='Green'), showlegend=False))
    if target_value == 1:
        fig.add_trace(go.Scatter(x=[train_df.iloc[c].name], y=[train_df.iloc[c]['Close']], mode='markers', marker=dict(color='Red'), showlegend=False))
    else:
        continue

fig

### 1.4 - New max and min points 

However, it may be the case that we label two local maximum consecutively, this is, there is no local minimum between them (or vice-versa for minimum points), like September 5, 2003 through Oct 13, 2003. Hence, we proceed to find equal situations as mentioned before and add a new point (either a max or a min) between them.

In [38]:
def find_new_min(df):
    max_founds = 0
    old = 0
    idx_old = 0
    write_index = True
    idx = 0

    new_points = []
    for c in df['target']:
        if c == 1:
            max_founds += 1
            if write_index == True:
                idx_old = idx
                write_index = False
        if c == 0:
            max_founds = 0
            write_index=True
        if max_founds == 2:
            #print(idx_old, idx, 'duplicated')


            x = np.array(df['Close'][idx_old:idx+1])
            minp = argrelextrema(x, np.less_equal, order=idx-idx_old)
            new_points.append(idx_old + minp[0][0])


            max_founds = 0
            write_index = True
        idx += 1
        
        
    labels = np.zeros(len(df['Close']))

    idx = 0
    for c in range(0, len(df['Close'])):

        if idx in new_points:
            labels[idx] = 0
        else:
            labels[idx] = df['target'].iloc[idx]
        idx += 1

    return labels

train_df['target'] = find_new_min(train_df)
test_df['target'] = find_new_min(test_df)

In [39]:
for c in range(len(train_df)):
    target_value = train_df.iloc[c]['target']
    if target_value == 0:
        fig.add_trace(go.Scatter(x=[train_df.iloc[c].name], y=[train_df.iloc[c]['Close']], mode='markers', marker=dict(color='Green'), showlegend=False))
    if target_value == 1:
        fig.add_trace(go.Scatter(x=[train_df.iloc[c].name], y=[train_df.iloc[c]['Close']], mode='markers', marker=dict(color='Red'), showlegend=False))
    else:
        continue

fig

If you check again the same dates that I mentioned, you will see that we have imputed a new local minimum between the two max points. The same is done for consecutive minimums.

In [40]:
def find_new_max(df):

    min_founds = 0
    old = 0
    idx_old = 0
    write_index = True
    idx = 0

    new_points = []
    for c in df['target']:
        if c == 0:
            min_founds += 1
            if write_index == True:
                idx_old = idx
                write_index = False
        if c == 1:
            min_founds = 0
            write_index=True
        if min_founds == 2:
            #print(idx_old, idx, 'duplicated')


            x = np.array(df['Close'][idx_old:idx+1])
            minp = argrelextrema(x, np.greater_equal, order=idx-idx_old)
            new_points.append(idx_old + minp[0][0])


            min_founds = 0
            write_index = True
        idx += 1

        
    labels = np.zeros(len(df['Close']))

    idx = 0
    for c in range(0, len(df['Close'])):

        if idx in new_points:
            labels[idx] = 1
        else:
            labels[idx] = df['target'].iloc[idx]
        idx += 1

    return labels

train_df['target'] = find_new_max(train_df)
test_df['target'] = find_new_max(test_df)

### 1.5 - Rolling mean and angular coefficients

We can add more features to our dataset by calculating the rolling mean for three different periods and the regression coefficient for those same periods. Notice that we lose some data points because the rolling mean for the first few days of the specified window is Nan.

In [41]:
def get_angular_coef(df, period):
    coef = []
    for c in range(0, len(df)):
        if c+period >= len(df):
            a, b = np.polyfit(x=[0, 1], y=[df[f'rolling_mean_{period}'].iloc[c], df[f'rolling_mean_{period}'].iloc[len(df)-1]], deg=1)
            coef.append(a)
        else:
            a, b = np.polyfit(x=[0, 1], y=[df[f'rolling_mean_{period}'].iloc[c], df[f'rolling_mean_{period}'].iloc[c+period]], deg=1)
            coef.append(a)

    return coef

In [42]:
# Train df

train_df['rolling_mean_5'] = train_df['Close'].rolling(5, closed='both').mean()
train_df['rolling_mean_10'] = train_df['Close'].rolling(10, closed='both').mean()
train_df['rolling_mean_21'] = train_df['Close'].rolling(21, closed='both').mean()
train_df.dropna(inplace=True)

train_df['angular_coef_5'] = get_angular_coef(train_df, 5)
train_df['angular_coef_10'] = get_angular_coef(train_df, 10)
train_df['angular_coef_21'] = get_angular_coef(train_df, 21)

# Test df

test_df['rolling_mean_5'] = test_df['Close'].rolling(5, closed='both').mean()
test_df['rolling_mean_10'] = test_df['Close'].rolling(10, closed='both').mean()
test_df['rolling_mean_21'] = test_df['Close'].rolling(21, closed='both').mean()
test_df.dropna(inplace=True)

test_df['angular_coef_5'] = get_angular_coef(test_df, 5)
test_df['angular_coef_10'] = get_angular_coef(test_df, 10)
test_df['angular_coef_21'] = get_angular_coef(test_df, 21)

In [43]:
train_df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,normalized_price,target,rolling_mean_5,rolling_mean_10,rolling_mean_21,angular_coef_5,angular_coef_10,angular_coef_21
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2000-01-31,1.857708,1.857708,1.857708,1.857708,32266240000,0.0,0.0,1.000000,0.0,1.883323,1.905913,1.955449,0.075022,6.740732e-02,5.823012e-02
2000-02-01,1.893763,1.893763,1.893763,1.893763,23672320000,0.0,0.0,1.000000,2.0,1.885569,1.901345,1.952645,0.112169,7.899503e-02,7.384671e-02
2000-02-02,1.930182,1.930182,1.930182,1.930182,14272000000,0.0,0.0,1.000000,2.0,1.893884,1.899259,1.943126,0.129650,9.889274e-02,9.392654e-02
2000-02-03,1.984809,1.984809,1.984809,1.984809,25950720000,0.0,0.0,1.000000,2.0,1.907541,1.903397,1.941471,0.125886,1.076001e-01,1.077987e-01
2000-02-04,2.035067,2.035067,2.035067,2.035067,21199360000,0.0,0.0,1.000000,2.0,1.931820,1.912502,1.943027,0.091774,1.022366e-01,1.159763e-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-11-13,3.777290,3.807508,3.641307,3.661453,67000600,0.0,0.0,0.121212,2.0,3.853675,3.932960,3.959744,0.005036,-3.479697e-02,-1.694050e-02
2015-11-16,3.681598,3.878017,3.681598,3.878017,56587800,0.0,0.0,1.000000,2.0,3.843602,3.937081,3.953792,0.015109,-3.891761e-02,-1.098840e-02
2015-11-17,3.928381,3.958599,3.847799,3.908235,52072800,0.0,0.0,0.545456,2.0,3.831011,3.939370,3.948298,0.027700,-4.120684e-02,-5.494161e-03
2015-11-18,3.933418,4.044219,3.923345,3.938455,38137700,0.0,0.0,0.125001,2.0,3.848638,3.909152,3.945322,0.010073,-1.098854e-02,-2.518177e-03


<a id="modeltraining"></a>

<div class="alert alert-block alert-info">
<b><h2><center>2. Train time!</center></h2></b>


</div>

### 2.1 - Model and features

Now that we have prepared our dataset, it's time to build our Machine Learning model. We start by proposing three subsets of features: 
- all_features: uses all features in the dataset;  
- f_classif_features: uses the features selected by the KBest algorithm with ANOVA F-value testing;  
- mutual_info_features: uses the features selected by the KBest algorithm with Mutual Information score.

The best subset will be selected based on the model score. The tuned and untuned algorithms will also be compared.

In [44]:
all_features = ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'normalized_price', 
                    'rolling_mean_5', 'rolling_mean_10', 'rolling_mean_21', 'angular_coef_5', 'angular_coef_10', 'angular_coef_21']

target = 'target'

selector = SelectKBest(f_classif, k=9)
selected = selector.fit_transform(train_df[all_features], train_df[target])
f_classif_features = selector.get_feature_names_out(all_features)

selector = SelectKBest(mutual_info_classif, k=9)
selected = selector.fit_transform(train_df[all_features], train_df[target])
mutual_info_features = selector.get_feature_names_out(all_features)

The classification algorithm chosen for this project is LogisticRegression with the *class_weight* parameter set to *balanced*, since we have a highly unbalanced problem.

In [45]:
t0 = len(train_df[train_df[target] == 0])/len(train_df)
t1 = len(train_df[train_df[target] == 1])/len(train_df)
t2 = len(train_df[train_df[target] == 2])/len(train_df)

print('Porportion of class 0:', t0)
print('Porportion of class 1:', t1)
print('Porportion of class 2:', t2)

Porportion of class 0: 0.02230576441102757
Porportion of class 1: 0.02230576441102757
Porportion of class 2: 0.9553884711779449


In [46]:
max_iter = 500
lr = LogisticRegression(class_weight='balanced', max_iter=max_iter)
features = [all_features, f_classif_features, mutual_info_features]


### 2.2 - Cross-validation

The cross-validation benchmark was first used to compared multiple models to use in this project, like tree-based. I kept it because it also stores the model performance without tuning its hyperparameters, so we can compare with the its tuned version later. The scoring method is AUC One vs. one, a multiclass metric which computes the average AUC of all pairwise combinations of classes and it's not sensitive to unbalancement.

In [47]:
def get_cross_validation(model, X, y, scoring='accuracy', n_iter=10):
    results = np.array([])
    for c in range(n_iter):
        kfold = StratifiedKFold()
        cross_results = cross_val_score(model, X, y, scoring=scoring, cv=kfold)
        results = np.concatenate([results, cross_results])
        
    return results

In [48]:
def benchmark(df, model, features, scoring='accuracy'):
    scores = np.zeros((1, len(features)))

    c = 0
    for feature in features:
        scaler = MinMaxScaler().fit(df[feature])
        X_scaled = scaler.transform(df[feature])


        values = get_cross_validation(model, X_scaled, df[target], scoring=scoring)
        scores[0, c] = values.mean()

        c += 1

    return scores

scores = benchmark(train_df, lr, features, scoring='roc_auc_ovo')
scores

array([[0.79864368, 0.78710462, 0.75518215]])

### 2.3 - Hyperparameter tuning and feature selection

We may try improving the model performance by tuning its hyperparameters using optuna. Here, we need to find a good value fot the C parameter, which controls the strength of the regularization.

In [49]:
def optimize(df, features, n_trial, max_iter, scoring='accuracy'):
    
    def objective_lr(trial, max_iter, X=None, y=None):  # Optuna function
        # LR optimization

        lr_C = trial.suggest_float('lr_C', 0.01, 100)

        classifier_obj = LogisticRegression(C=lr_C, class_weight='balanced', max_iter=max_iter)

        score = cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=StratifiedKFold(n_splits=5), scoring=scoring)
        acc = score.mean()
        return acc
    
    
    
    lr_results = []
    for feature in features:
        X = df[feature]
        y = df[target]
        
        scaler = MinMaxScaler().fit(X)
        X_scaled = scaler.transform(X)
        
        study_lr = optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=42))
        study_lr.optimize(lambda trial: objective_lr(trial, max_iter=max_iter, X=X_scaled, y=y), n_trials=n_trial)
        lr_results.append(study_lr)
            
    
    results_dictionary = {'LR_study':[lr_results[0], lr_results[1], lr_results[2]]}
    return results_dictionary
    

opt = optimize(train_df, features, n_trial=5, max_iter=500, scoring='roc_auc_ovo')
opt_array = np.array([[opt['LR_study'][0].best_trial.values[0], 
                       opt['LR_study'][1].best_trial.values[0], 
                       opt['LR_study'][2].best_trial.values[0]]])
scores = np.concatenate([scores, opt_array])
scores

array([[0.79864368, 0.78710462, 0.75518215],
       [0.9177393 , 0.85092884, 0.8892418 ]])

Afterward, we find the highest score without tuning, the highest score after tuning, with their respective indexes, compare both values and get the best one.

In [50]:
def select_feature(features, scores):
    no_tune_max = np.where(scores[0] == scores[0].max())[0]
    no_tune_feature = no_tune_max[len(no_tune_max)-1]
    
    tune_max = np.where(scores[1] == scores[1].max())[0]
    tune_feature = tune_max[len(tune_max)-1]
    
    if scores[1, tune_feature] > scores[0, no_tune_feature]:
        tune = True
        return scores[1, tune_feature], features[tune_feature], tune_feature, tune
    else:
        tune = False
        return scores[0, no_tune_feature], features[no_tune_feature], no_tune_feature, tune
    
accuracy, feature_subset, idx, tune = select_feature(features, scores)
print('Highest accuracy:', accuracy, end='\n\n')
print('Features used:', feature_subset, end='\n\n')
print('Tuned?', tune)

Highest accuracy: 0.9177393029070384

Features used: ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'normalized_price', 'rolling_mean_5', 'rolling_mean_10', 'rolling_mean_21', 'angular_coef_5', 'angular_coef_10', 'angular_coef_21']

Tuned? True


### 2.4 - Predictions

We can finally use our model to predict whether a point is a minimum, maximum or neither. Let's see the confusion matrix to evaluate the quality of those forecasts.

In [51]:
X = train_df[feature_subset]
y = train_df[target]

scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)

if tune == True:
    lr = LogisticRegression(C=opt['LR_study'][idx].best_trial.params['lr_C'], max_iter=max_iter, class_weight='balanced')
else:
    lr = LogisticRegression(max_iter=max_iter, class_weight='balanced')
    
lr.fit(X_scaled, y)

preds = lr.predict(X_scaled)
px.imshow(confusion_matrix(y, preds), 
labels=dict(x='Predicted', y='True value', color='Number of samples'),
                x=['0(buy)', '1(sell)', '2(hold)'],
                y=['0(BUY)', '1(SELL)', '2(HOLD)'], text_auto='.0f')

We may want to be more certain about the predictions regarding classes 0 and 1. For that, we increase the required probablity to assign labels 0 and 1.

At the end, we increased the number of "false 2" but strongly decreased the number of "false 0" and "false 1".

In [52]:
def predict_probabilities(model, X):
    preds = np.zeros(len(X))
    probs = model.predict_proba(X)

    idx = 0
    for row in probs:
        if row[0] >= 0.75:
            preds[idx] = 0
        else:
            if row[1] > 0.75:
                preds[idx] = 1
            else:
                preds[idx] = 2
        idx += 1

    return preds


model_predictions = predict_probabilities(lr, X_scaled)

px.imshow(confusion_matrix(y, model_predictions), 
labels=dict(x='Predicted', y='True value', color='Number of samples'),
                x=['0(buy)', '1(sell)', '2(hold)'],
                y=['0(BUY)', '1(SELL)', '2(HOLD)'], text_auto='.0f')

For comparison, let's plot the new points using the previous image. The light green points are the predicted buy opportunities, while the light salmon ones are the predicted sell spots. From the plot below, we see that our model missed a lot of opportunities at the beggining, but performed better as the time goes by.

In [53]:
train_df['preds'] = model_predictions

for c in range(len(train_df)):
    target_value = train_df.iloc[c]['preds']
    
    if target_value == 0:
        fig.add_trace(go.Scatter(x=[train_df.iloc[c].name], y=[train_df.iloc[c]['Close']], mode='markers', marker=dict(color='lightgreen'), showlegend=False))
    if target_value == 1:
        fig.add_trace(go.Scatter(x=[train_df.iloc[c].name], y=[train_df.iloc[c]['Close']], mode='markers', marker=dict(color='lightsalmon'), showlegend=False))
    else:
        continue

fig.update_layout(title='Close price time series with model predictions')
fig

<a id="profitability"></a>

<div class="alert alert-block alert-info">
<b><h2><center>3. Profitability</center></h2></b>


</div>

We start with some initial budget and no stocks, then iterate day by day while tracking our model predictions: 

- if it's a 0, we buy an amount of stocks corresponding to our current money and the price of that day, decreasing our current money quantity and increasing our number of stocks;  
- if it's a 1, we sell the amount of stocks we have, increasing our current money considering the price of the sell day and finishing our positions;  
- if it's a 2, we do nothing.

Also, the function stores all trades that the result was **different from zero**, that's because if our model predicts two consecutive sell days, at the second day we will have no stocks (since we sold them before) and the earned money will be zero. 

Finally, if the amount of money earned is higher than 20 thousand, we have to pay a 15% tax.

In [54]:
def testing(X, preds, initial_budget):
    
    values = []
    different_from_zero_sells = []
    current_money = initial_budget
    n_of_stocks = 0
    
    previous_buy_price = 0
    
    for c in range(len(preds)):
        
        if preds[c] == 0:  # buy
            
            previous_buy_price = X.iloc[c]['Close'] 
            n_of_stocks_buy = (current_money // X.iloc[c]['Close'])
            spent = n_of_stocks_buy*X.iloc[c]['Close']
            
            n_of_stocks += n_of_stocks_buy
            current_money = current_money - spent
        

        elif preds[c] == 1:  # sell
                        
            earned = n_of_stocks * X.iloc[c]['Close']
            current_money = current_money + earned
            
            if earned != 0:
                different_from_zero_sells.append((X.iloc[c]['Close'] - previous_buy_price)*n_of_stocks)
                
            if earned > 20000:
                earned = 0.85*earned
                
            n_of_stocks = 0

        elif preds[c] == 2:  # do nothing
                        
            pass 
        
        values.append((n_of_stocks * X.iloc[c]['Close']) + current_money)
        
    
    money_value = (n_of_stocks * X.iloc[len(X)-1]['Close']) + current_money
    return money_value, different_from_zero_sells, values

money, different_zero_trades, values = testing(train_df, model_predictions, initial_budget=5000)

The bar chart below shows all non-zero trades that we would have made on the train set, where we see that we would have a loss only once

In [55]:
colors = []
for c in different_zero_trades:
    if c > 0:
        colors.append('green')
    else:
        colors.append('red')


bar_trades = go.Figure()
bar_trades.add_trace(go.Bar(y=different_zero_trades, marker_color=colors))
bar_trades.update_yaxes(title='Earnings')
bar_trades.update_layout(title='Earnings for every trade (train set)')
bar_trades

In [56]:
figure = go.Figure()
figure.add_trace(go.Scatter(x=train_df.index, y=values))
figure.update_yaxes(title='Money value')
figure.update_layout(title='Evolution of a 5000 investment')
figure

We can do the same thing for the test set

In [57]:
Xtest_scaled = scaler.transform(test_df[feature_subset])

model_predictions_test = predict_probabilities(lr, Xtest_scaled)
money, different_zero_trades, values = testing(test_df, model_predictions_test, initial_budget=5000)

The amount of money earned was lower than the one gained on the train set, but, suprisingly, we didn't take any loss by using this model to trade in the test set.

In [58]:
colors = []
for c in different_zero_trades:
    if c > 0:
        colors.append('green')
    else:
        colors.append('red')


bar_trades = go.Figure()
bar_trades.add_trace(go.Bar(y=different_zero_trades, marker_color=colors))
bar_trades.update_yaxes(title='Earnings')
bar_trades.update_layout(title='Earnings for every trade (train set)')
bar_trades

In [59]:
figure = go.Figure()
figure.add_trace(go.Scatter(x=test_df.index, y=values))
figure.update_yaxes(title='Money value')
figure.update_layout(title='Evolution of a 5000 investment')
figure

With a better understanding of how this project is scheduled, it's time to select a group of stocks and simulate how a Logistic Regression model (as well as Random Forest and SVC) would perform in the task of classifying buy, hold or sell spots.

*See next notebook*