#### Weekly machine learning strategy

It would be nice to predict if the price of BTC/USD is going to go up or down over the next week or so. We will go ahead and try to set this up as a machine learning problem, specifically a classification problem. 

Here are the specifications: 
* If the next week's return in BTC/USD is greater than some acceptable return (2%) -> label as +1 
* If the next week's return in BTC/USD is less than some acceptable return (-2%) -> label as -1 
* If the return is within this band, label as zero. 

The features we will try are completely endogenous to the BTC/USD price data. They will all be derived from it. Here is a set of features we will try out: 
* Past 5 weekly returns

After we try these out, we can use some more sophisticated features. 

In [5]:
import yfinance as yf 
import pandas as pd 
import numpy as np 

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from labels import label_by_threshold


In [2]:
btc_data = yf.download("BTC-USD", progress=False)

In [3]:
weekly_btc_data = btc_data.resample('W').agg({'Open': 'first', 
                                              'High': 'max', 
                                              'Low': 'min', 
                                              'Close': 'last', 
                                              'Adj Close': 'last', 
                                              'Volume': 'sum'})

weekly_returns = weekly_btc_data['Adj Close'].pct_change()

In [4]:
return_threshold = 0.02

target_returns = weekly_returns.shift(-1)
target_returns.dropna(inplace=True)

target = target_returns.apply(label_by_threshold, **{'upper_threshold': return_threshold, "lower_threshold": -return_threshold})

In [9]:
target.value_counts(normalize=True).round(2)

Adj Close
 1    0.45
-1    0.30
 0    0.25
Name: proportion, dtype: float64

So we see that most of the observations are +1 which makes sense given the long term trajectory of BTC/USD over the time period. There are a healthy amount of times that the weekly return is negative though and a good amount of zeros too. Let's take a stab at creating the features and then a random forest model now. 

#### Creating features

In [45]:
features = pd.DataFrame()

for i in range(1, 15): 
    features[f'return_{i}_lag'] = weekly_returns.shift(i-1)


In [46]:
all_data = pd.concat([features, target], axis = 1).dropna()

In [47]:
features_train, features_test, target_train, target_test = train_test_split(all_data.iloc[:, :-1], 
                                                                            all_data.iloc[:, -1], 
                                                                            test_size=0.2, shuffle=False)

In [57]:
forest_model = RandomForestClassifier(
    n_estimators=300, 
    n_jobs=-1, 
    random_state=100, 
    max_depth=5, 
    class_weight='balanced_subsample'
) 


In [58]:

forest_model.fit(features_train, target_train)

In [59]:
forest_model.score(features_test, target_test)

0.34831460674157305

We see that this model has pretty bad accuracy on this dataset. We tried to get it to be balanced but the accuracy looks super bad at around 35%. There are a couple ways we can improve this: 
* Set up better more predictive features --> can do this through research and trying technical indicators and other things 
    * Can also become more creative around the features we want to try out 
* Try out other types of models which might be better. Could try a neural network model potentially here but we have a small set of data though
* Get other data sources which will support other features which lead into this model working nicely 
* Try to model something else ..... do daily data? Minutely? 

