This notebook is an extension to the previous _multilinear\_bitcoin\_regression_ notebook. The goal of the code below is to build and evaluate if a nonlinear bitcoin regression model performs better than the previous one (linear).
<br> __Note__ This code does not contain details discussed in the previous notebook (BTC_multilinear_regression).

The starting point is the best model created in the previous notebook (the time for the data is limited to the period between 12.2014 and 04.2020 and predictors like BTC volume, gold price and oil price are used). The result to beat is equal $RMSE = 2170$.
 
This part proceeded in the following steps:
1. Incorporation of nonlinear associations in the linear model including transformed versions of the predictors in the following form:
 $$ Y=X+\sqrt{X} + X^2 + X^3 $$
In other words it is evaluated if bitcoin price can be explained with nonlinear relationships with the selected predictors. 
2. Testing the model performance using:
 - the ordinary least square method (Nonlinear associations can still be evaluated with the means of ordinary least squares (OLS))
 - the lasso method (with differen alpha parameters)
<br> The model performance is evaluated by
  - spliting the dataset into ten different train and test subsets (10-fold validation with shuffle). For each testset the root mean squared error (RMSE) is calculated and the the mean RMSE (mean of ten RMSE) is calculated. As regards the lasso method a few models with different $ \alpha $ parameter are evaluated
3. Comparison of calculated RMSE.

The lasso method is selected as the second method because it enables some coefficients to be equal zero. <br>
__Note__ the author is familiar with methods like _LassoCV_ and _Cross_val_score_. Nevertheless he wanted to show the necessary steps using _KFold_ function

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import probplot

from sklearn.exceptions import ConvergenceWarning
import warnings

from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Loading the dataset and preparing it for analysis

In [15]:
data=pd.read_csv('BTC_regression_data.csv',index_col='Date')
data.index=pd.to_datetime(data.index,format='%Y-%m-%d')
date_mask = (data.index > pd.to_datetime('2017-05-01')) & (data.index < pd.to_datetime('2020-02-01'))
data = data[date_mask]
y=data['BTC price [USD]']
X=data[['Volume BTC','Gold price[USD]','Oil WTI price[USD]']]
X

Unnamed: 0_level_0,Volume BTC,Gold price[USD],Oil WTI price[USD]
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-05-02,11924.59,1324.0,47.65
2017-05-03,16309.77,1315.5,47.79
2017-05-04,26688.81,1295.3,45.55
2017-05-05,16885.42,1293.7,46.23
2017-05-08,15881.13,1294.0,46.46
...,...,...,...
2020-01-27,9424.43,1589.5,53.09
2020-01-28,16365.39,1581.7,53.33
2020-01-29,12035.69,1582.0,53.29
2020-01-30,11818.13,1595.1,52.19


Adding nonlinear termn do the dataset

In [16]:
X_nonlinear=X.copy()
scale=['square root','squared','cubed']
scale_num=[0.5,2,3]
for i in X_nonlinear.columns:
    for j,k in zip(scale,scale_num):
        name = j + " " + str(i)  
        X_nonlinear[name]=X_nonlinear[i]**k

In [17]:
X_nonlinear.head()

Unnamed: 0_level_0,Volume BTC,Gold price[USD],Oil WTI price[USD],square root Volume BTC,squared Volume BTC,cubed Volume BTC,square root Gold price[USD],squared Gold price[USD],cubed Gold price[USD],square root Oil WTI price[USD],squared Oil WTI price[USD],cubed Oil WTI price[USD]
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-05-02,11924.59,1324.0,47.65,109.199771,142195800.0,1695627000000.0,36.386811,1752976.0,2320940000.0,6.902898,2270.5225,108190.397125
2017-05-03,16309.77,1315.5,47.79,127.70971,266008600.0,4338539000000.0,36.269822,1730540.25,2276526000.0,6.913031,2283.8841,109146.821139
2017-05-04,26688.81,1295.3,45.55,163.367102,712292600.0,19010240000000.0,35.990276,1677802.09,2173257000.0,6.749074,2074.8025,94507.253875
2017-05-05,16885.42,1293.7,46.23,129.943911,285117400.0,4814327000000.0,35.968041,1673659.69,2165214000.0,6.799265,2137.2129,98803.352367
2017-05-08,15881.13,1294.0,46.46,126.020355,252210300.0,4005384000000.0,35.972211,1674436.0,2166720000.0,6.816157,2158.5316,100285.378136


Firstly the performance of the nonlinear model is evaluated by the __ordinary least square method__ method

In [18]:
kf=KFold(n_splits=10,shuffle=True,random_state=1)
RMSE_lr=[]


#linear model is evaluated 10 times in this loop
for train_idx, test_idx in kf.split(X_nonlinear):
    x_train=X_nonlinear.iloc[train_idx,:]
    x_test=X_nonlinear.iloc[test_idx,:]
    y_train=y.iloc[train_idx]
    y_test=y.iloc[test_idx]
    linear_regression_model=LinearRegression()
    linear_regression_model.fit(x_train,y_train)
    y_pred_linear_regression=linear_regression_model.predict(x_test)
    RMSE_lr.append(mean_squared_error(y_test,y_pred_linear_regression,squared=False))


In [19]:
print('root mean squared error for linear regression: ')
[ print('{} cross validation, RSE: {}'.format(i,j)) for i,j in enumerate(np.array(RMSE_lr),start=1)]
print('\nthe mean RMSE equals {}'.format(np.array(RMSE_lr).mean()))

root mean squared error for linear regression: 
1 cross validation, RSE: 2214.9228300560967
2 cross validation, RSE: 1751.7776630848875
3 cross validation, RSE: 2029.0719960541057
4 cross validation, RSE: 1866.0641841688414
5 cross validation, RSE: 2147.2777587918176
6 cross validation, RSE: 2177.1560218016984
7 cross validation, RSE: 2485.7235870228924
8 cross validation, RSE: 2307.961499684483
9 cross validation, RSE: 2375.583830766533
10 cross validation, RSE: 1946.3778699637019

the mean RMSE equals 2130.1917241395054


Next __the lasso method__ is used.

In [20]:
#first it is check what order of magnitude is propoer for the alpha value in the lasso method
alphas=np.array([0.01,0.1,1,10,100,1000])
RMSE_lasso= []

warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")

for train_idx, test_idx in kf.split(X_nonlinear):  
    x_train=X_nonlinear.iloc[train_idx,:]
    x_test=X_nonlinear.iloc[test_idx,:]
    y_train=y.iloc[train_idx]
    y_test=y.iloc[test_idx]
    
    #for each train-test split lasso with six different alpha values is examinated
    RMSE=[]
    for k in alphas:
        model_lasso=Lasso(alpha=k,max_iter=1e6,tol=0.0001)
        model_lasso.fit(x_train,y_train)
        
        y_pred_lasso=model_lasso.predict(x_test)
        RMSE.append(mean_squared_error(y_test,y_pred_lasso,squared=False))
        

    RMSE_lasso.append(np.array(RMSE))

In [21]:
np.array(RMSE_lasso).mean(axis=0)
print('root mean squared error for the lasso method for different alpha: ')
for i,j in zip(alphas,np.array(RMSE_lasso).mean(axis=0)):
    print('for alpha equal to {} mean RMSE for lasso equals {}'.format(i,j)) 


root mean squared error for the lasso method for different alpha: 
for alpha equal to 0.01 mean RMSE for lasso equals 2128.585481130199
for alpha equal to 0.1 mean RMSE for lasso equals 2131.1465445281274
for alpha equal to 1.0 mean RMSE for lasso equals 2132.354750680741
for alpha equal to 10.0 mean RMSE for lasso equals 2142.6430852333115
for alpha equal to 100.0 mean RMSE for lasso equals 2145.270637580833
for alpha equal to 1000.0 mean RMSE for lasso equals 2144.2181007844792


It seems that 0.01 alpha value has the lowest RMSE. Next it is examinated further


In [22]:
alphas2=np.array([0.001,0.0025,0.005,0.0075,0.01,0.0125,0.015,0.0175,0.02])
RMSE_lasso2=[]
for train_idx, test_idx in kf.split(X_nonlinear):  
    x_train=X_nonlinear.iloc[train_idx,:]
    x_test=X_nonlinear.iloc[test_idx,:]
    y_train=y.iloc[train_idx]
    y_test=y.iloc[test_idx]
    
    #for each train-test split lasso with six different alpha values is examinated
    RMSE=[]
    for k in alphas2:
        model_lasso=Lasso(alpha=k,max_iter=1e6,tol=0.0001)
        model_lasso.fit(x_train,y_train)
        y_pred_lasso=model_lasso.predict(x_test)
        RMSE.append(mean_squared_error(y_test,y_pred_lasso,squared=False))

    RMSE_lasso2.append(np.array(RMSE))

In [23]:
np.array(RMSE_lasso2).mean(axis=0)
print('root mean squared error for the lasso method for different alpha: ')
for i,j in zip(alphas2,np.array(RMSE_lasso2).mean(axis=0)):
    print('for alpha equal to {} mean RMSE for lasso equals {}'.format(i,j))

root mean squared error for the lasso method for different alpha: 
for alpha equal to 0.001 mean RMSE for lasso equals 2128.3030052475333
for alpha equal to 0.0025 mean RMSE for lasso equals 2128.345508043212
for alpha equal to 0.005 mean RMSE for lasso equals 2128.420687404928
for alpha equal to 0.0075 mean RMSE for lasso equals 2128.500832592823
for alpha equal to 0.01 mean RMSE for lasso equals 2128.585481130199
for alpha equal to 0.0125 mean RMSE for lasso equals 2128.6741129458906
for alpha equal to 0.015 mean RMSE for lasso equals 2128.765842445536
for alpha equal to 0.0175 mean RMSE for lasso equals 2128.860087461192
for alpha equal to 0.02 mean RMSE for lasso equals 2128.9565596515777


In [24]:
summary=pd.DataFrame(columns=['linear model','nonlinear model - OLS','nonlinear model - lasso'])
summary=summary.append({'linear model':2170,'nonlinear model - OLS':2130,'nonlinear model - lasso':2128},ignore_index=True)
summary.index=np.array(['RMSE'])
summary

Unnamed: 0,linear model,nonlinear model - OLS,nonlinear model - lasso
RMSE,2170,2130,2128


### The conclusion
- the improvement resulting from adding nonlinear terms is meaningless.  The RMSE dropped from 2170 to 2128 (for the lasso method with $ \alpha = 0.001 $) which is about 2\%. No further examination is necessary
- the general conclusion is as follows: it is impossible to build a reliable and useful model for bitcoin price prediction (using main market assets) with the means of linear and nonlinear regression  