# AC109 Project Modeling Results: Predicting the returns on Cryptocurrencies

by Ali Dastjerdi, Angelina Massa, Sachin Mathur & Nate Stein

### Supporting Libraries

We outsourced some of the supporting code to other modules we wrote located in the main directory with the intent of having this notebook focus on the presentation of results. The supporting modules are:
- `crypto_utils.py` contains the code we used to scrape and clean data from coinmarketcap.com. It also contains the code used to wrangle/preprocess that data (saved in CSV files) into our design matrix. We needed to spin off the creation of the design matrix into its own `.py` file in order to create unit tests to ensure the resulting features matched what we expected based on hand-calculated figures. This became especially important as we engineered more involved features that built off previous features and assumptions. 
- `crypto_models.py` contains the code we used to iterate over multiple classification and regression models and summarize the results for variou performance metrics in a `DataFrame`.

In [1]:
import create_models
import crypto_utils as cryp
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.model_selection as model_selection
import time as time
from xgboost import XGBRegressor

from crypto_utils import fmt_date, print_update
from sklearn.metrics import mean_absolute_error

In [2]:
# Custom output options.

np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)
sns.set_style('white')
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['savefig.bbox'] = 'tight'
plt.rcParams['savefig.pad_inches'] = 0.05
%matplotlib inline

In [3]:
RAND_STATE = 88  # used whenever random seed permitted for consistency

## Construct Design Matrix

We want the construction of the design matrix to be agile enough to allow us to easily change whether we include certain features, which cryptocurrency's price return we want to forecast, etc.

In [4]:
def get_regression_data(x_cryptos, y_crypto, test_size, params):
    design = cryp.DesignMatrix(x_cryptos=x_cryptos, y_crypto=y_crypto, **params)
    X, Y = design.get_data(lag_indicator=True)
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, Y, test_size=test_size, random_state=RAND_STATE)
    return X_train, X_test, y_train, y_test

In [5]:
crypto_scope = ['ltc', 'xrp', 'xlm', 'eth', 'btc']

# Store x cryptocurrencies and y crypto (the one we're forecasting)
# in list of tuples.
xy_crypto_pairs = []
for y_crypto in crypto_scope:
    x_cryptos = [c for c in crypto_scope if c != y_crypto]
    xy_crypto_pairs.append((x_cryptos, y_crypto))

# Modeling: Regression

In [6]:
from sklearn.linear_model import LinearRegression

In [7]:
N_CROSSVAL = 3
TEST_SIZE = 0.2

## Baseline Model

In [8]:
def evaluate_baseline_model(x_cryptos, y_crypto, params):
    """Return MAE on test set."""
    X_train, X_test, y_train, y_test = get_regression_data(x_cryptos, 
                                                           y_crypto, TEST_SIZE,
                                                           params)
    lr = LinearRegression().fit(X_train, y_train)
    return mean_absolute_error(y_test, lr.predict(X_test))

### Determine optimal rolling window for measuring changes in price and volume

Ultimately we want to determine which `n_rolling_volume`, `n_rolling_price` and `n_std_window` to use going forward, as it will influence our more advanced features.

In [9]:
def find_optimal_rolling_periods():
    """Iterates over many different rolling period windows and evaluates 
    MAE on test set.
    
    Notes: Takes ~18min to run.
    """
    df_results = pd.DataFrame(columns=['y', 'mae', 'n_rolling_price', 
                                       'n_rolling_volume', 'n_std_window'])

    params = {'n_rolling_price':None, 'n_rolling_volume':None,
              'x_assets':[], 'n_std_window':None}

    n_rolling_prices = range(1, 5)
    n_rolling_volumes = range(1, 5)
    n_std_windows = range(5, 60, 5)
    
    combo_total = len(n_rolling_prices) * len(n_rolling_volumes) * len(n_std_windows)
    combo_count = 0
    
    t0 = time.time()
    for n_price in n_rolling_prices:
        for n_vol in n_rolling_volumes:
            for n_std in n_std_windows:
                combo_count += 1
                print_update('Trying param combination {}/{}...'.format(
                    combo_count, combo_total))
                params['n_rolling_price'] = n_price
                params['n_rolling_volume'] = n_vol
                params['n_std_window'] = n_std
                new_row = {'n_rolling_price': n_price,
                           'n_rolling_volume': n_vol,
                           'n_std_window': n_std}
                for x_cryps, y_cryp in xy_crypto_pairs:
                    new_row['y'] = y_cryp
                    new_row['mae'] = evaluate_baseline_model(x_cryps, y_cryp, 
                                                             params)
                    df_results = df_results.append(new_row, ignore_index=True)
    print_update('Finished all parameter combinations in {:.2f} seconds.'.format(
        time.time() - t0))
    
    # Compute an average for each window tuple across all cryptos.
    avg_results = df_results.groupby(['n_rolling_price', 'n_rolling_volume', 
                                      'n_std_window']).mean()
    return df_results, avg_results

After iterating over many rolling window options in `find_optimal_rolling_periods()`, we can determine that the optimal parameters are:
- `n_rolling_price`: 1
- `n_rolling_volume`: 1
- `n_std_window`: 10

In [10]:
OPTIMAL_PARAMS = {'n_rolling_price':1, 'n_rolling_volume':1,
                  'x_assets':[], 'n_std_window':10}

## Try Additional Regression Models

Now time to experiment with different models using the optimal time windows we solved for above using our baseline model.

Run all models using each cryptocurrency as our target ($y$) value.

In [16]:
regression_results = {}  # store DataFrame for each target (crypto).

for x_cryptos, y_crypto in xy_crypto_pairs:
    X_train, X_test, y_train, y_test = get_regression_data(
        x_cryptos, y_crypto, TEST_SIZE, OPTIMAL_PARAMS)
    regression_results[y_crypto] = create_models.regression_models(
        X_train, y_train, X_test, y_test, scoring=mean_absolute_error)
    regression_results[y_crypto].sort_values(
        'score', ascending=True, inplace=True)

Finished evaluating regression models.           

In [17]:
for y_crypto, df_results in regression_results.items():
    print('Test Set MAE for {}'.format(y_crypto))
    display(df_results)
    print()  # for space

Test Set MAE for ltc


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0226,alpha,0.0051
ElasticNet,0.0226,l1_ratio,0.1
XGBRegressor,0.0227,n_estimators,23.0
Ridge,0.0238,alpha,10.0
RandomForest,0.0296,n_estimators,25.0



Test Set MAE for xrp


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XGBRegressor,0.0353,n_estimators,66.0
ElasticNet,0.0379,l1_ratio,0.2
Lasso,0.0379,alpha,0.0012
Ridge,0.039,alpha,10.0
RandomForest,0.0834,n_estimators,10.0



Test Set MAE for xlm


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0483,alpha,0.0122
ElasticNet,0.0483,l1_ratio,0.1
XGBRegressor,0.0485,n_estimators,98.0
Ridge,0.0507,alpha,10.0
RandomForest,0.0619,n_estimators,15.0



Test Set MAE for eth


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0491,alpha,0.0114
ElasticNet,0.0491,l1_ratio,0.1
XGBRegressor,0.0491,n_estimators,57.0
Ridge,0.0499,alpha,10.0
RandomForest,0.0534,n_estimators,25.0



Test Set MAE for btc


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XGBRegressor,0.0175,n_estimators,94.0
Lasso,0.0176,alpha,0.0033
ElasticNet,0.0176,l1_ratio,1.0
Ridge,0.0189,alpha,10.0
RandomForest,0.0193,n_estimators,25.0





## News Feature

In [18]:
# Read in nyt news data
news = pd.read_csv('NLP/data/nyt_data_sentiment.csv')
# clean up dates
news['date'] = pd.to_datetime(news['date'])
# group by day and take average sentiment
daily_news = news.groupby('date').mean()
# add news to parameter list
OPTIMAL_PARAMS['add_news'] = True
OPTIMAL_PARAMS['news'] = daily_news

In [19]:
regression_results_news = {}  # store DataFrame for each target (crypto).

for x_cryptos, y_crypto in xy_crypto_pairs:
    X_train, X_test, y_train, y_test = get_regression_data(
        x_cryptos, y_crypto, TEST_SIZE, OPTIMAL_PARAMS)
    regression_results_news[y_crypto] = create_models.regression_models(
        X_train, y_train, X_test, y_test, scoring=mean_absolute_error)
    regression_results_news[y_crypto].sort_values(
        'score', ascending=True, inplace=True)
for y_crypto, df_results in regression_results_news.items():
    print('Test Set MAE for {}'.format(y_crypto))
    display(df_results)
    print()  # for space

Test Set MAE for ltcregression models.           


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0226,alpha,0.0051
ElasticNet,0.0226,l1_ratio,0.1
XGBRegressor,0.0234,n_estimators,63.0
Ridge,0.0238,alpha,10.0
RandomForest,0.0293,n_estimators,25.0



Test Set MAE for xrp


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XGBRegressor,0.0356,n_estimators,86.0
ElasticNet,0.0379,l1_ratio,0.2
Lasso,0.0379,alpha,0.0012
Ridge,0.039,alpha,10.0
RandomForest,0.0816,n_estimators,15.0



Test Set MAE for xlm


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0483,alpha,0.0122
ElasticNet,0.0483,l1_ratio,0.1
XGBRegressor,0.0485,n_estimators,58.0
Ridge,0.0507,alpha,10.0
RandomForest,0.0586,n_estimators,20.0



Test Set MAE for eth


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lasso,0.0491,alpha,0.0114
ElasticNet,0.0491,l1_ratio,0.1
XGBRegressor,0.0492,n_estimators,96.0
Ridge,0.0499,alpha,10.0
RandomForest,0.0532,n_estimators,25.0



Test Set MAE for btc


Unnamed: 0_level_0,score,hyperparam,value
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XGBRegressor,0.0175,n_estimators,14.0
Lasso,0.0176,alpha,0.0033
ElasticNet,0.0176,l1_ratio,1.0
Ridge,0.0189,alpha,10.0
RandomForest,0.021,n_estimators,25.0





In [46]:
for currency in regression_results_news.keys(): 
    old_score = regression_results[currency]['score']
    new_score = regression_results_news[currency]['score']
    precent_change = (new_score - old_score)/old_score*100
    print('Test Set MAE '.format(currency))
    display(precent_change.rename('% change with new feature').to_frame())
    print()  # for spa
    

Test Set MAE 


Unnamed: 0_level_0,% change with new feature
model,Unnamed: 1_level_1
Lasso,0.0
ElasticNet,0.0
XGBRegressor,3.042
Ridge,0.0
RandomForest,-1.129



Test Set MAE 


Unnamed: 0_level_0,% change with new feature
model,Unnamed: 1_level_1
XGBRegressor,1.0361
ElasticNet,0.0
Lasso,0.0
Ridge,0.0
RandomForest,-2.0949



Test Set MAE 


Unnamed: 0_level_0,% change with new feature
model,Unnamed: 1_level_1
Lasso,0.0
ElasticNet,0.0
XGBRegressor,-0.0777
Ridge,0.0
RandomForest,-5.3086



Test Set MAE 


Unnamed: 0_level_0,% change with new feature
model,Unnamed: 1_level_1
Lasso,0.0
ElasticNet,0.0
XGBRegressor,0.07
Ridge,0.0
RandomForest,-0.297



Test Set MAE 


Unnamed: 0_level_0,% change with new feature
model,Unnamed: 1_level_1
XGBRegressor,0.0027
Lasso,0.0
ElasticNet,0.0
Ridge,0.0
RandomForest,8.4862





# Modeling: Classification

In [50]:
import create_models
# remove news feature
PARAMS['add_news'] = False
CLF_THRESH = 0.01  

In [51]:
def get_classification_data(x_cryptos, y_crypto, thresh, test_size, params):
    """Returns X_train, X_test, y_train, y_test data to use in the 
    classification problem.
    
    Args:
        thresh (float): Threshold to use in determining whether an observation 
        is classified as a Buy or Sell (vs. Do Nothing).
    """
    design = cryp.DesignMatrix(x_cryptos, y_crypto, **params)
    X, Y = design.get_data(lag_indicator=True, y_category=True,
                           y_category_thresh=thresh)
    return model_selection.train_test_split(X, Y, test_size=test_size, 
                                            random_state=RAND_STATE)

In [59]:
clf_results = {}

for i, (x_cryptos, y_crypto) in enumerate(xy_crypto_pairs):
    print_update('Evaluating model for {0} ({1}/{2})'.format(
        y_crypto, i+1, len(xy_crypto_pairs)))
    X_train, X_test, y_train, y_test = get_classification_data(
        x_cryptos, y_crypto, CLF_THRESH, TEST_SIZE, OPTIMAL_PARAMS)
    clf_results[y_crypto] = create_models.traditional_models(
        X_train, y_train, X_test, y_test, pos_label=[1])
    clf_results[y_crypto].sort_values('Accuracy', ascending=True, 
                                      inplace=True)

Evaluating model for btc (5/5)                   

In [53]:
for y_crypto, df_results in clf_results.items():
    print('Test Set Directional Accuracy for {}'.format(y_crypto))
    display(df_results)
    print()  # for space

Test Set Directional Accuracy for ltc


Unnamed: 0,AUC,Accuracy,D_Accuracy
RandomForest,0.5437,0.3431,0.1569
LDA,0.5767,0.4161,0.2
QDA,0.517,0.4161,0.25
ADABoost,0.5861,0.4599,0.1538
SVM,0.5102,0.4891,0.0
LogReg,0.4806,0.4964,
KNN,0.4959,0.4964,



Test Set Directional Accuracy for xrp


Unnamed: 0,AUC,Accuracy,D_Accuracy
KNN,0.4799,0.2847,0.2456
QDA,0.4952,0.292,0.24
LDA,0.5377,0.3431,0.3302
RandomForest,0.5158,0.3504,0.3125
LogReg,0.5201,0.3577,0.2791
ADABoost,0.4907,0.3577,0.3009
SVM,0.4219,0.4015,0.0



Test Set Directional Accuracy for xlm


Unnamed: 0,AUC,Accuracy,D_Accuracy
SVM,0.4371,0.219,0.0
LogReg,0.5419,0.2409,0.3438
QDA,0.4672,0.3358,0.4902
RandomForest,0.4828,0.3942,0.4016
LDA,0.5344,0.4088,0.4167
ADABoost,0.525,0.4161,0.4161
KNN,0.4652,0.438,0.438



Test Set Directional Accuracy for eth


Unnamed: 0,AUC,Accuracy,D_Accuracy
SVM,0.554,0.2336,0.0
QDA,0.4715,0.2993,0.3846
RandomForest,0.4181,0.3504,0.3459
KNN,0.4422,0.3796,0.3796
LogReg,0.4856,0.3869,0.3869
LDA,0.5148,0.3942,0.3971
ADABoost,0.4298,0.4015,0.4015



Test Set Directional Accuracy for btc


Unnamed: 0,AUC,Accuracy,D_Accuracy
LogReg,0.4913,0.1825,0.1825
QDA,0.5162,0.4161,0.2667
ADABoost,0.5025,0.4672,0.3333
SVM,0.5158,0.4672,0.0
KNN,0.5039,0.4745,1.0
LDA,0.5105,0.4745,0.4091
RandomForest,0.4456,0.4818,0.4211





## News Feature

In [54]:
PARAMS['add_news'] = True

In [56]:
clf_results_news = {}

for i, (x_cryptos, y_crypto) in enumerate(xy_crypto_pairs):
    print_update('Evaluating model for {0} ({1}/{2})'.format(
        y_crypto, i+1, len(xy_crypto_pairs)))
    X_train, X_test, y_train, y_test = get_classification_data(
        x_cryptos, y_crypto, CLF_THRESH, TEST_SIZE, OPTIMAL_PARAMS)
    clf_results_news[y_crypto] = create_models.traditional_models(
        X_train, y_train, X_test, y_test, pos_label=[1])
    clf_results_news[y_crypto].sort_values('Accuracy', ascending=True, 
                                      inplace=True)
for y_crypto, df_results in clf_results_news.items():
    print('Test Set Directional Accuracy for {}'.format(y_crypto))
    display(df_results)
    print()  # for space

Test Set Directional Accuracy for ltc            


Unnamed: 0,AUC,Accuracy,D_Accuracy
LDA,0.5767,0.4161,0.2
QDA,0.517,0.4161,0.25
RandomForest,0.5525,0.4161,0.186
ADABoost,0.5861,0.4599,0.1538
SVM,0.5102,0.4891,0.0
LogReg,0.4817,0.4964,
KNN,0.4959,0.4964,



Test Set Directional Accuracy for xrp


Unnamed: 0,AUC,Accuracy,D_Accuracy
KNN,0.4799,0.2847,0.2456
QDA,0.4952,0.292,0.24
LDA,0.5377,0.3431,0.3302
RandomForest,0.5319,0.3504,0.3
LogReg,0.5201,0.3577,0.2791
ADABoost,0.4907,0.3577,0.3009
SVM,0.4219,0.4015,0.0



Test Set Directional Accuracy for xlm


Unnamed: 0,AUC,Accuracy,D_Accuracy
SVM,0.4371,0.219,0.0
LogReg,0.5419,0.2409,0.3438
QDA,0.4672,0.3358,0.4902
LDA,0.5344,0.4088,0.4167
RandomForest,0.4955,0.4161,0.4309
ADABoost,0.525,0.4161,0.4161
KNN,0.4652,0.438,0.438



Test Set Directional Accuracy for eth


Unnamed: 0,AUC,Accuracy,D_Accuracy
LogReg,0.4856,0.2336,
SVM,0.554,0.2336,0.0
QDA,0.4715,0.2993,0.3846
RandomForest,0.4689,0.3504,0.3667
KNN,0.4422,0.3796,0.3796
LDA,0.5148,0.3942,0.3971
ADABoost,0.4298,0.4015,0.4015



Test Set Directional Accuracy for btc


Unnamed: 0,AUC,Accuracy,D_Accuracy
LogReg,0.4913,0.1825,0.1825
QDA,0.5162,0.4161,0.2667
ADABoost,0.5025,0.4672,0.3333
SVM,0.5158,0.4672,0.0
KNN,0.5039,0.4745,1.0
LDA,0.5105,0.4745,0.4091
RandomForest,0.4864,0.4818,0.4412





In [60]:
for currency in clf_results.keys(): 
    old_score = clf_results[currency]
    new_score = clf_results_news[currency]
    precent_change = (new_score - old_score)/old_score*100
    print('Test Set Directional Accuracy % change with new feature '.format(currency))
    display(precent_change)
    print()  # for space

Test Set Directional Accuracy % change with new feature 


Unnamed: 0,AUC,Accuracy,D_Accuracy
ADABoost,0.0,0.0,0.0
KNN,0.0,0.0,
LDA,0.0,0.0,0.0
LogReg,0.2177,0.0,
QDA,0.0,0.0,0.0
RandomForest,2.5249,7.5472,-4.9096
SVM,0.0,0.0,



Test Set Directional Accuracy % change with new feature 


Unnamed: 0,AUC,Accuracy,D_Accuracy
ADABoost,2.6743,6.5217,-1.1378
KNN,0.0,0.0,0.0
LDA,0.0,0.0,0.0
LogReg,0.0,0.0,0.0
QDA,0.0,0.0,0.0
RandomForest,-2.2178,2.1277,-5.0
SVM,0.0,0.0,



Test Set Directional Accuracy % change with new feature 


Unnamed: 0,AUC,Accuracy,D_Accuracy
ADABoost,0.0,0.0,0.0
KNN,0.0,0.0,0.0
LDA,0.0,0.0,0.0
LogReg,0.0,0.0,0.0
QDA,0.0,0.0,0.0
RandomForest,0.0228,5.5556,6.8259
SVM,0.0,0.0,



Test Set Directional Accuracy % change with new feature 


Unnamed: 0,AUC,Accuracy,D_Accuracy
ADABoost,0.0,0.0,0.0
KNN,0.0,0.0,0.0
LDA,0.0,0.0,0.0
LogReg,0.0,-39.6226,
QDA,0.0,0.0,0.0
RandomForest,21.1198,-2.0408,0.8333
SVM,0.0,0.0,



Test Set Directional Accuracy % change with new feature 


Unnamed: 0,AUC,Accuracy,D_Accuracy
ADABoost,0.0,0.0,0.0
KNN,0.0,0.0,0.0
LDA,0.0,0.0,0.0
LogReg,0.0,-45.6522,-17.3465
QDA,0.0,0.0,0.0
RandomForest,4.3959,3.125,14.7059
SVM,0.0,0.0,



