# AC109 Project Modeling Results: Predicting the returns on Cryptocurrencies

by Ali Dastjerdi, Angelina Massa, Sachin Mathur & Nate Stein

### Supporting Libraries

We outsourced some of the supporting code to other modules we wrote located in the main directory with the intent of having this notebook focus on the presentation of results. The supporting modules are:
- `crypto_utils.py` contains the code we used to scrape and clean data from coinmarketcap.com. It also contains the code used to wrangle/preprocess that data (saved in CSV files) into our design matrix. We needed to spin off the creation of the design matrix into its own `.py` file in order to create unit tests to ensure the resulting features matched what we expected based on hand-calculated figures. This became especially important as we engineered more involved features that built off previous features and assumptions. 
- `crypto_models.py` contains the code we used to iterate over multiple classification and regression models and summarize the results for variou performance metrics in a `DataFrame`.

In [1]:
import create_models
import crypto_utils as cryp
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.model_selection as model_selection
import time as time

from crypto_utils import fmt_date, print_update
from sklearn.metrics import mean_absolute_error

In [2]:
# Custom output options.

np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)
sns.set_style('white')
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['savefig.bbox'] = 'tight'
plt.rcParams['savefig.pad_inches'] = 0.05
%matplotlib inline

In [3]:
RAND_STATE = 88  # used whenever random seed permitted for consistency

## Construct Design Matrix

We want the construction of the design matrix to be agile enough to allow us to easily change whether we include certain features, which cryptocurrency's price return we want to forecast, etc.

In [4]:
def get_regression_data(x_cryptos, y_crypto, test_size, params):
    design = cryp.DesignMatrix(x_cryptos=x_cryptos, y_crypto=y_crypto, **params)
    X, Y = design.get_data(lag_indicator=True)
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, Y, test_size=test_size, random_state=RAND_STATE)
    return X_train, X_test, y_train, y_test

In [5]:
crypto_scope = ['ltc', 'xrp', 'xlm', 'eth', 'btc']

# Store x cryptocurrencies and y crypto (the one we're forecasting)
# in list of tuples.
xy_crypto_pairs = []
for y_crypto in crypto_scope:
    x_cryptos = [c for c in crypto_scope if c != y_crypto]
    xy_crypto_pairs.append((x_cryptos, y_crypto))

# Modeling: Regression

In [6]:
from sklearn.linear_model import LinearRegression

In [7]:
N_CROSSVAL = 3
TEST_SIZE = 0.2

## Baseline Model

In [8]:
def evaluate_baseline_model(x_cryptos, y_crypto, params):
    """Return MAE on test set."""
    X_train, X_test, y_train, y_test = get_regression_data(x_cryptos, 
                                                           y_crypto, TEST_SIZE,
                                                           params)
    lr = LinearRegression().fit(X_train, y_train)
    return mean_absolute_error(y_test, lr.predict(X_test))

### Determine optimal rolling window for measuring changes in price and volume

Ultimately we want to determine which `n_rolling_volume`, `n_rolling_price` and `n_std_window` to use going forward, as it will influence our more advanced features.

In [9]:
def find_optimal_rolling_periods():
    """Iterates over many different rolling period windows and evaluates 
    MAE on test set.
    
    Notes: Takes ~18min to run.
    """
    df_results = pd.DataFrame(columns=['y', 'mae', 'n_rolling_price', 
                                       'n_rolling_volume', 'n_std_window'])

    params = {'n_rolling_price':None, 'n_rolling_volume':None,
              'x_assets':[], 'n_std_window':None}

    n_rolling_prices = range(1, 5)
    n_rolling_volumes = range(1, 5)
    n_std_windows = range(5, 60, 5)
    
    combo_total = len(n_rolling_prices) * len(n_rolling_volumes) * len(n_std_windows)
    combo_count = 0
    
    t0 = time.time()
    for n_price in n_rolling_prices:
        for n_vol in n_rolling_volumes:
            for n_std in n_std_windows:
                combo_count += 1
                print_update('Trying param combination {}/{}...'.format(
                    combo_count, combo_total))
                params['n_rolling_price'] = n_price
                params['n_rolling_volume'] = n_vol
                params['n_std_window'] = n_std
                new_row = {'n_rolling_price': n_price,
                           'n_rolling_volume': n_vol,
                           'n_std_window': n_std}
                for x_cryps, y_cryp in xy_crypto_pairs:
                    new_row['y'] = y_cryp
                    new_row['mae'] = evaluate_baseline_model(x_cryps, y_cryp, 
                                                             params)
                    df_results = df_results.append(new_row, ignore_index=True)
    print_update('Finished all parameter combinations in {:.2f} seconds.'.format(
        time.time() - t0))
    
    # Compute an average for each window tuple across all cryptos.
    avg_results = df_results.groupby(['n_rolling_price', 'n_rolling_volume', 
                                      'n_std_window']).mean()
    return df_results, avg_results

After iterating over many rolling window options in `find_optimal_rolling_periods()`, we can determine that the optimal parameters are:
- `n_rolling_price`: 1
- `n_rolling_volume`: 1
- `n_std_window`: 10

In [10]:
OPTIMAL_PARAMS = {'n_rolling_price':1, 'n_rolling_volume':1,
                  'x_assets':[], 'n_std_window':10}

## Try Additional Regression Models

Now time to experiment with different models using the optimal time windows we solved for above using our baseline model.

Run all models using each cryptocurrency as our target ($y$) value.

In [11]:
regression_results = {}  # store DataFrame for each target (crypto).

for x_cryptos, y_crypto in xy_crypto_pairs:
    X_train, X_test, y_train, y_test = get_regression_data(
        x_cryptos, y_crypto, TEST_SIZE, OPTIMAL_PARAMS)
    regression_results[y_crypto] = create_models.regression_models(
        X_train, y_train, X_test, y_test, scoring=mean_absolute_error)
    regression_results[y_crypto].sort_values(
        'score', ascending=True, inplace=True)

Finished evaluating regression models.           

In [12]:
for y_crypto, df_results in regression_results.items():
    print('Test Set MAE for {}'.format(y_crypto))
    display(df_results)
    print()  # for space

Test Set MAE for btc


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0282,alpha,0.0049
ElasticNet,0.0282,l1_ratio,0.1
XGBRegressor,0.0282,n_estimators,94.0
Ridge,0.0285,alpha,10.0
RandomForest,0.0308,n_estimators,20.0



Test Set MAE for xrp


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ridge,0.0431,alpha,10.0
Lasso,0.0434,alpha,0.0415
ElasticNet,0.0434,l1_ratio,0.1
XGBRegressor,0.0436,n_estimators,53.0
RandomForest,0.0501,n_estimators,25.0



Test Set MAE for xlm


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ridge,0.0621,alpha,10.0
Lasso,0.0627,alpha,0.0101
ElasticNet,0.0627,l1_ratio,1.0
XGBRegressor,0.063,n_estimators,21.0
RandomForest,0.0657,n_estimators,15.0



Test Set MAE for ltc


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0368,alpha,0.0058
ElasticNet,0.0368,l1_ratio,0.1
Ridge,0.0369,alpha,10.0
XGBRegressor,0.0375,n_estimators,92.0
RandomForest,0.0413,n_estimators,25.0



Test Set MAE for eth


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ridge,0.0498,alpha,10.0
XGBRegressor,0.05,n_estimators,80.0
Lasso,0.05,alpha,0.0066
ElasticNet,0.05,l1_ratio,0.1
RandomForest,0.0545,n_estimators,25.0





# Modeling: Classification

In [13]:
CLF_THRESH = 0.01  # threshold to classify as Buy/Sell

In [14]:
def get_classification_data(x_cryptos, y_crypto, thresh, test_size, params):
    """Returns X_train, X_test, y_train, y_test data to use in the 
    classification problem.
    
    Args:
        thresh (float): Threshold to use in determining whether an observation 
        is classified as a Buy or Sell (vs. Do Nothing).
    """
    design = cryp.DesignMatrix(x_cryptos, y_crypto, **params)
    X, Y = design.get_data(lag_indicator=True, y_category=True,
                           y_category_thresh=thresh)
    return model_selection.train_test_split(X, Y, test_size=test_size, 
                                            random_state=RAND_STATE)

In [15]:
clf_results = {}

for i, (x_cryptos, y_crypto) in enumerate(xy_crypto_pairs):
    print_update('Evaluating model for {0} ({1}/{2})'.format(
        y_crypto, i+1, len(xy_crypto_pairs)))
    X_train, X_test, y_train, y_test = get_classification_data(
        x_cryptos, y_crypto, CLF_THRESH, TEST_SIZE, OPTIMAL_PARAMS)
    clf_results[y_crypto] = create_models.traditional_models(
        X_train, y_train, X_test, y_test, pos_label=[1])
    clf_results[y_crypto].sort_values('Accuracy', ascending=True, 
                                      inplace=True)

Evaluating model for btc (5/5)                   

In [16]:
for y_crypto, df_results in clf_results.items():
    print('Test Set Directional Accuracy for {}'.format(y_crypto))
    display(df_results)
    print()  # for space

Test Set Directional Accuracy for btc


Unnamed: 0,AUC,Accuracy,D_Accuracy
LogReg,0.5124,0.2917,0.2917
KNN,0.5349,0.3333,0.2933
ADABoost,0.522,0.3542,0.3846
SVM,0.4968,0.3542,0.0
LDA,0.479,0.3646,0.3333
QDA,0.4165,0.375,0.3378
RandomForest,0.4454,0.4323,0.43



Test Set Directional Accuracy for xrp


Unnamed: 0,AUC,Accuracy,D_Accuracy
QDA,0.4663,0.276,0.3243
LogReg,0.5076,0.2812,0.0
SVM,0.5302,0.2812,0.0
LDA,0.5528,0.3594,0.3642
KNN,0.545,0.3854,0.3953
ADABoost,0.4784,0.3958,0.3958
RandomForest,0.4418,0.4219,0.4366



Test Set Directional Accuracy for xlm


Unnamed: 0,AUC,Accuracy,D_Accuracy
SVM,0.5391,0.1823,0.0
QDA,0.4981,0.2604,0.4286
ADABoost,0.4854,0.4271,0.4339
LogReg,0.5468,0.4323,0.4346
KNN,0.5162,0.4479,0.4479
LDA,0.5561,0.4531,0.4531
RandomForest,0.4943,0.4531,0.4503



Test Set Directional Accuracy for ltc


Unnamed: 0,AUC,Accuracy,D_Accuracy
KNN,0.5112,0.2917,0.2727
SVM,0.4708,0.3021,0.0
LDA,0.5099,0.3385,0.4286
RandomForest,0.5169,0.3385,0.3085
ADABoost,0.5464,0.3438,0.4359
QDA,0.4636,0.3646,0.3939
LogReg,0.5,0.3698,0.3698



Test Set Directional Accuracy for eth


Unnamed: 0,AUC,Accuracy,D_Accuracy
LogReg,0.4749,0.1875,
SVM,0.5228,0.1875,0.0
QDA,0.4667,0.25,0.4167
RandomForest,0.5326,0.4115,0.4317
KNN,0.5597,0.4375,0.4375
ADABoost,0.4647,0.4375,0.4421
LDA,0.5212,0.4427,0.4427





In [17]:
# regression_results['btc'].to_latex('reg_results_btc')