### Project 1: Predicting Bitcoin in Feburary, 2018

Code Author: Anh Hong Le

#### Project description: 
- Read data into Jupyter notebook, use pandas to import data into a data frame
- preprocess data: explore data, address missing data, categorical data, if there is any, and data scaling. Justify the type of scaling used in this project. 
- train your dataset using all the linear regression models you've learned so far. If your model has a scaling parameter(s) use Grid Search to find the best scaling parameter. Use plots and graphs to help you get a better glimpse of the results. 
- Then use cross validation to find average training and testing score. 
- Your submission should have at least the following regression models: KNN repressor, linear regression, Ridge, Lasso, polynomial regression, SVM both simple and with kernels. 
- Finally find the best repressor for this dataset and train your model on the entire dataset using the best parameters and predict the market price for the test_set.

#### Table of Contents
- Data Exploration and Preprocessing
- Model Selection: KNN Regressor/ Linear Regressor / Polynomial / Kernelized SVR
- Predicting Bitcoin Price ( Polynomial)

# Data Exploration and Preprocessing

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.feature_selection import f_regression as freg
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [69]:
data = pd.read_csv('bitcoin_dataset.csv')
test = pd.read_csv('test_set.csv')

In [70]:
data.head()

Unnamed: 0,Date,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
0,2/17/2010 0:00,0.0,2043200.0,0.0,0.0,0.0,0.000235,0,1.0,0.0,...,31.781022,0.0,241,244,41240,244,244,65173.13,36500.0,0.0
1,2/18/2010 0:00,0.0,2054650.0,0.0,0.0,0.0,0.000241,0,1.0,0.0,...,154.463801,0.0,234,235,41475,235,235,18911.74,7413.0,0.0
2,2/19/2010 0:00,0.0,2063600.0,0.0,0.0,0.0,0.000228,0,1.0,0.0,...,1278.516635,0.0,185,183,41658,183,183,9749.98,700.0,0.0
3,2/20/2010 0:00,0.0,2074700.0,0.0,0.0,0.0,0.000218,0,1.0,0.0,...,22186.68799,0.0,224,224,41882,224,224,11150.03,50.0,0.0
4,2/21/2010 0:00,0.0,2085400.0,0.0,0.0,0.0,0.000234,0,1.0,0.0,...,689.179876,0.0,218,218,42100,218,218,12266.83,1553.0,0.0


In [71]:
data.tail()


Unnamed: 0,Date,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
2901,1/27/2018 0:00,11524.77667,16830312.5,193966000000.0,763094600.0,153844.0759,1.038548,0,1232.980892,11.6,...,1.778601,126.855696,541699,193578,295802277,188058,126082,1363301.068,119799.4611,1380662000.0
2902,1/28/2018 0:00,11765.71,16832287.5,198044000000.0,738104200.0,154006.9753,1.031009,0,1350.924051,11.95,...,1.302242,117.430262,492738,213446,296015723,205967,137919,3128906.096,163590.5694,1924759000.0
2903,1/29/2018 0:00,11212.655,16834137.5,188755000000.0,611119700.0,154157.6651,1.018174,0,1568.756757,12.275,...,1.243012,96.382352,532630,232176,296247899,225983,155772,1941048.853,160557.7065,1800278000.0
2904,1/30/2018 0:00,10184.06167,16836225.0,171461000000.0,1266284000.0,154322.579,0.987509,0,1416.820359,11.075,...,1.301143,96.749249,531440,236609,296484508,230310,158259,2359671.266,172755.8071,1759356000.0
2905,1/31/2018 0:00,10125.01333,16837687.5,170482000000.0,933239800.0,154444.5903,1.042831,0,1745.948718,15.675,...,1.016284,80.529148,481100,204276,296688784,197264,141543,1785708.486,159867.3306,1618659000.0


In [72]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906 entries, 0 to 2905
Data columns (total 24 columns):
Date                                                   2906 non-null object
btc_market_price                                       2906 non-null float64
btc_total_bitcoins                                     2879 non-null float64
btc_market_cap                                         2906 non-null float64
btc_trade_volume                                       2885 non-null float64
btc_blocks_size                                        2877 non-null float64
btc_avg_block_size                                     2906 non-null float64
btc_n_orphaned_blocks                                  2906 non-null int64
btc_n_transactions_per_block                           2906 non-null float64
btc_median_confirmation_time                           2894 non-null float64
btc_hash_rate                                          2906 non-null float64
btc_difficulty                                   

In [73]:
data.describe()


Unnamed: 0,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,btc_hash_rate,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
count,2906.0,2879.0,2906.0,2885.0,2877.0,2906.0,2906.0,2906.0,2894.0,2906.0,...,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0
mean,839.104218,11511380.0,13442550000.0,73983810.0,35505.502848,0.350366,0.364074,671.673651,7.501113,1244070.0,...,66.747821,14.639125,193786.1,102081.138334,68445580.0,94348.852374,63140.320028,1566216.0,203647.5,202433800.0
std,2304.972497,4200024.0,38661500000.0,292422800.0,43618.633821,0.353168,0.842259,689.561322,4.974549,2924141.0,...,1761.894646,20.536083,208914.6,103896.92935,82853410.0,103966.111763,69687.052174,2278910.0,268278.1,580051300.0
min,0.0,2043200.0,0.0,0.0,0.0,0.000216,0.0,1.0,0.0,2.25e-05,...,0.136531,0.0,110.0,118.0,41240.0,118.0,118.0,6150.0,7.0,0.0
25%,6.653465,8485300.0,53630810.0,291645.6,781.0,0.024177,0.0,54.0,6.066667,11.6088,...,1.181945,4.15647,16754.75,8025.25,2413376.0,6813.5,6765.5,490171.2,96003.25,958168.0
50%,235.13,12431150.0,3346869000.0,10014140.0,15183.0,0.196022,0.0,375.0,7.916667,21761.89,...,2.493564,7.82243,130445.0,62337.0,32552710.0,53483.0,35283.5,1105205.0,178468.5,37425760.0
75%,594.191164,15200510.0,8075525000.0,28340380.0,58293.0,0.676065,0.0,1232.995223,10.208333,1035363.0,...,5.915591,14.800589,360376.5,190471.25,108066300.0,185901.75,113793.25,2031654.0,258804.6,131249900.0
max,19498.68333,16837690.0,326525000000.0,5352016000.0,154444.5903,1.110327,7.0,2722.625,47.733333,21609750.0,...,88571.42857,161.686071,1072861.0,490644.0,296688800.0,470650.0,318896.0,45992220.0,5825066.0,5760245000.0


### Finding missing data and imputation( 'pad' method)

In [74]:
null_columns=data.columns[data.isnull().any()] 
data[null_columns].isnull().sum()
null_columns

Index(['btc_total_bitcoins', 'btc_trade_volume', 'btc_blocks_size',
       'btc_median_confirmation_time', 'btc_difficulty',
       'btc_transaction_fees'],
      dtype='object')

In [75]:
data_pad= data.fillna( method='pad') # filling data with pad method
data_pad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906 entries, 0 to 2905
Data columns (total 24 columns):
Date                                                   2906 non-null object
btc_market_price                                       2906 non-null float64
btc_total_bitcoins                                     2906 non-null float64
btc_market_cap                                         2906 non-null float64
btc_trade_volume                                       2906 non-null float64
btc_blocks_size                                        2906 non-null float64
btc_avg_block_size                                     2906 non-null float64
btc_n_orphaned_blocks                                  2906 non-null int64
btc_n_transactions_per_block                           2906 non-null float64
btc_median_confirmation_time                           2906 non-null float64
btc_hash_rate                                          2906 non-null float64
btc_difficulty                                   

### Filtering Data ( since 2013)
Due to more significant activities of Bitcoin since 2013, I only use the later timeline to find a better patterns and prediction

In [76]:
data['Date'] = pd.to_datetime(data.Date) 
plt.plot(data['Date'],data['btc_market_price'])
plt.gcf().autofmt_xdate()
plt.show()

<IPython.core.display.Javascript object>

In [77]:
data_pad1= data_pad[data_pad['Date'] >= '2013-01-01'] ## Filtering data
X= data_pad1.loc[:,'btc_total_bitcoins':'btc_estimated_transaction_volume_usd']
y= data_pad['btc_market_price'][data_pad['Date'] >= '2013-01-01']
print(X.shape)
print(y.shape)

(1712, 22)
(1712,)


## Feature Selection ( 9 variables)
 Based on F-Test, narrowing down the feature size is executed to about half the original variables. The purpose is to showcase a comparison among different algorithms with significant results instead of correcting outliers' impacts.

In [78]:
import sklearn.feature_selection 
select = sklearn.feature_selection.SelectKBest(score_func=freg, k=9)
selected_features= select.fit(X,y)
indices_selected= selected_features.get_support(indices=True)
colnames_selected= [X.columns[i] for i in indices_selected]
print(select.scores_) # F-test score
print(select.pvalues_) #P-values

[7.84473252e+02 9.44226409e+05 4.19031798e+03 4.03898394e+03
 1.59309367e+03 2.63404807e-01 1.51902407e+03 4.22823454e+02
 1.39796690e+04 1.61100101e+04 2.55839097e+04 2.47013939e+03
 6.77137287e-01 6.98936171e+02 2.20781312e+03 1.60870075e+03
 4.20875117e+03 1.58776156e+03 2.03378720e+03 4.36279807e+01
 1.47387286e+02 1.15674314e+04]
[2.14182548e-142 0.00000000e+000 0.00000000e+000 0.00000000e+000
 9.50079287e-247 6.07856895e-001 2.53848560e-238 3.89928181e-084
 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
 4.10688971e-001 2.01605811e-129 3.70697662e-310 1.68387496e-248
 0.00000000e+000 3.78474927e-246 2.80740765e-293 5.28658697e-011
 1.35759020e-032 0.00000000e+000]


In [79]:
X = X[colnames_selected]
print(colnames_selected)
print(X.shape)

['btc_market_cap', 'btc_trade_volume', 'btc_blocks_size', 'btc_hash_rate', 'btc_difficulty', 'btc_miners_revenue', 'btc_transaction_fees', 'btc_n_transactions_total', 'btc_estimated_transaction_volume_usd']
(1712, 9)


In [80]:
print(X.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1712 entries, 12 to 2782
Data columns (total 9 columns):
btc_market_cap                          1712 non-null float64
btc_trade_volume                        1712 non-null float64
btc_blocks_size                         1712 non-null float64
btc_hash_rate                           1712 non-null float64
btc_difficulty                          1712 non-null float64
btc_miners_revenue                      1712 non-null float64
btc_transaction_fees                    1712 non-null float64
btc_n_transactions_total                1712 non-null int64
btc_estimated_transaction_volume_usd    1712 non-null float64
dtypes: float64(8), int64(1)
memory usage: 133.8 KB
None


### Choosing a random variable for plotting parameters in sample data ( btc_trade_volume)


In [81]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
xtemp= X['btc_trade_volume'].reshape(-1,1)
xtemp= sc.fit_transform(xtemp)
plt.figure(figsize=(5,4))
plt.scatter(xtemp, y, marker= 'o', s=50, alpha=0.8)
plt.title('Least-squares linear regression')
plt.xlabel('Bitcoin trade volume')
plt.ylabel('Bitcoin price')
plt.show()


  This is separate from the ipykernel package so we can avoid doing imports until


<IPython.core.display.Javascript object>

### Data Standardization
The reason why I choose standardization because it may provide potential uses in feature selection and further interpretation.

In [82]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
mat1 = sc.fit_transform(X)
from sklearn.neighbors import KNeighborsRegressor
X_train, X_test, y_train, y_test = train_test_split(mat1,y,
                                                   random_state = 10)

# KNN REGRESSOR

### Gridsearch and Cross-validation

In [83]:
from sklearn.model_selection import GridSearchCV
param = {'n_neighbors':[1, 3, 7, 15, 55]} #create dict of parameters

grid_search_knn = GridSearchCV(KNeighborsRegressor(),param, cv=10) # each parameter has a cv of 10 folds.
grid_search_knn.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search_knn.best_params_)) 
print("Best cross-validation score: {:.2f}".format(grid_search_knn.best_score_)) 
print("Best estimator:\n{}".format(grid_search_knn.best_estimator_))

Best parameters: {'n_neighbors': 3}
Best cross-validation score: 0.99
Best estimator:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=3, p=2,
          weights='uniform')


### Training/Testing Scores

In [84]:
results = pd.DataFrame(grid_search_knn.cv_results_)
scores = np.array(results.mean_test_score)
train_score = grid_search_knn.score(X_train,y_train)
test_score = grid_search_knn.score(X_test,y_test)
print('KNN score( training): {:.6f}'
     .format(train_score))
print('KNN score (testing): {:.6f}'
     .format(test_score))


KNN score( training): 0.996844
KNN score (testing): 0.989560




## Sample Plots ( KNN Regressor)
Plotting k-NN regression on sample dataset for different values of K ( using mentioned variable 'btc_trade_volume')

In [85]:
fig, subaxes = plt.subplots(5,1, figsize=(5,15))
X_predict_input = np.linspace(-2, 4.25, 500).reshape(-1,1)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(xtemp, y,
                                                   random_state = 10)

for thisaxis, K in zip(subaxes, [1, 3, 7, 15, 55]):
    knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train_temp, y_train_temp)
    y_predict_output = knnreg.predict(X_predict_input)
    train_temp_score = knnreg.score(X_train_temp, y_train_temp)
    test_temp_score = knnreg.score(X_test_temp, y_test_temp)
    thisaxis.set_xlim([-2.5, 4.5])
    thisaxis.plot(X_predict_input, y_predict_output)
    thisaxis.plot(X_train_temp, y_train_temp, 'o', alpha=0.9, label='train_temp')
    thisaxis.plot(X_test_temp, y_test_temp, '^', alpha=0.9, label='test_temp')
    thisaxis.set_xlabel('Input feature')
    thisaxis.set_ylabel('Target value')
    thisaxis.set_title('KNN Regression (K={})\n\
train_temp $R^2 = {:.6f}$,  test_temp $R^2 = {:.6f}$'
                      .format(K, train_temp_score, test_temp_score))
    thisaxis.legend()
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

<IPython.core.display.Javascript object>

# LINEAR REGRESSION

### Cross-validation in linear regression

In [86]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
scores_train_linreg = cross_val_score(linreg, X_train, y_train, cv = 10) 
print("Cross validation scores: {}".format(scores_train_linreg))
print("Average cross-validation score: {:.6f}".format(scores_train_linreg.mean()))

Cross validation scores: [0.99974855 0.99976578 0.99973001 0.99968014 0.99983073 0.99975066
 0.9997786  0.99981991 0.99985177 0.99966224]
Average cross-validation score: 0.999762


### Simple linear regression

In [87]:
linreg= linreg.fit(X_train,y_train) 
print('linear model intercept: {}'
     .format(linreg.intercept_))
print('linear model coeff:\n{}'
     .format(linreg.coef_))
print('R-squared score (training): {:.6f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.6f}'
     .format(linreg.score(X_test, y_test)))



linear model intercept: 491.17624584839456
linear model coeff:
[ 817.3039598    -2.36961754 -747.45784386  -78.86297925   48.47033405
   81.22378577  -13.11200998  772.12349955    0.81861088]
R-squared score (training): 0.999773
R-squared score (test): 0.999784


### Sample Plot ( Simple Linear Regression)
Example plot to predict bitcoin price ( using 'btc_trade_volume')

In [88]:
from sklearn.linear_model import LinearRegression 

xtemp= X['btc_trade_volume'].reshape(-1,1)
xtemp= sc.fit_transform(xtemp)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(xtemp, y,
                                                   random_state = 10)
linreg1 = LinearRegression().fit(X_train_temp, y_train_temp)

plt.figure(figsize=(7,4))
plt.scatter(xtemp, y, marker= 'o', s=70, alpha=0.5)
plt.plot(xtemp, linreg1.coef_ * xtemp + linreg1.intercept_, 'r-')
plt.title('Least-squares linear regression')
plt.xlabel('Bitcoin trade volume')
plt.ylabel('Bitcoin Price')
plt.show()


  This is separate from the ipykernel package so we can avoid doing imports until


<IPython.core.display.Javascript object>

# Ridge linear regression

### Grid Search & Cross-validation

In [89]:
from sklearn.linear_model import Ridge

alpha = {'alpha':[0, 1, 10, 20, 50, 100, 1000]} #create dict of parameters

grid_search_ridge = GridSearchCV(Ridge(),alpha, cv=10) # each parameter has a cv of 10 folds. Try max_iter if possible.
grid_search_ridge.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search_ridge.best_params_))
print("Best cross-validation score: {:.6f}".format(grid_search_ridge.best_score_)) 
print("Best estimator:\n{}".format(grid_search_ridge.best_estimator_))


Best parameters: {'alpha': 0}
Best cross-validation score: 0.999762
Best estimator:
Ridge(alpha=0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)


### Training/Testing Scores

In [90]:
results = pd.DataFrame(grid_search_ridge.cv_results_)
scores_ridge = np.array(results.mean_test_score)
train_score = grid_search_ridge.score(X_train,y_train)
test_score = grid_search_ridge.score(X_test,y_test)
print('R-squared score (training): {:.6f}'
     .format(train_score))
print('R-squared score (test): {:.6f}'
     .format(test_score))



R-squared score (training): 0.999773
R-squared score (test): 0.999784




# Lasso linear regression

### Lasso GridSearchCV & Training/Testing scores

In [91]:
from sklearn.linear_model import Lasso

alpha = {'alpha':[0.5, 1, 2, 3, 5, 10, 20, 50]} #create dict of parameters

grid_search_Lasso = GridSearchCV(Lasso(),alpha, cv=10) # each parameter has a cv of 10 folds. Try max_iter if possible.
grid_search_Lasso.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search_Lasso.best_params_)) # The pairt hta has highest score
print("Best cross-validation score: {:.6f}".format(grid_search_Lasso.best_score_)) 
print("Best estimator:\n{}".format(grid_search_Lasso.best_estimator_))

Best parameters: {'alpha': 0.5}
Best cross-validation score: 0.999464
Best estimator:
Lasso(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)


In [92]:
results = pd.DataFrame(grid_search_Lasso.cv_results_)
scores_Lasso = np.array(results.mean_test_score)
train_score = grid_search_Lasso.score(X_train,y_train)
test_score = grid_search_Lasso.score(X_test,y_test)
print('R-squared score (training): {:.6f}'
     .format(train_score))
print('R-squared score (test): {:.6f}'
     .format(test_score))


R-squared score (training): 0.999486
R-squared score (test): 0.999547




### Sample plot with different alpha( Lasso)

In [93]:
fig, subaxes = plt.subplots(4,1, figsize=(5,10))
xtemp= X['btc_trade_volume'].reshape(-1,1)
xtemp= sc.fit_transform(xtemp)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(xtemp, y,
                                                   random_state = 10)
for this_alpha, subplot in zip([0.5, 5, 10, 50], subaxes):
    lasso =Lasso(alpha=this_alpha).fit(X_train_temp, y_train_temp)
    y_predict_output = lasso.predict(X_test_temp)
    subplot.plot(X_test_temp, y_predict_output)
    subplot.plot(X_train_temp, y_train_temp, 'o', alpha=0.9, label='train_temp')
    subplot.plot(X_test_temp, y_test_temp, '^', alpha=0.9, label='test_temp')
    subplot.set_title('alpha = {:.2f}'.format(this_alpha))
    #subplot.set_xlim([-1, 10])
    subplot.legend()
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

<IPython.core.display.Javascript object>

  


# Polynomial regression

### Transform features to quadratic form

In [94]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(mat1)

X_train, X_test, y_train, y_test = train_test_split(X_poly, y, random_state = 10)
linreg = LinearRegression().fit(X_train, y_train)

print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.6f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.6f}\n'
     .format(linreg.score(X_test, y_test)))


(poly deg 2) linear model coeff (w):
[-1.01508354e-09  1.01925222e+03  7.35329736e+00  1.21097974e+02
 -7.07657910e+01 -2.24487039e+01  4.79067507e+01  2.07232582e+00
 -1.47822004e+02 -2.49639427e+01  5.61323902e+01  3.73327257e+00
 -2.64992576e+03  5.79069030e+01 -1.62118208e+02 -5.67033063e+01
 -8.83668923e+00  2.66904224e+03 -2.68312692e+01  2.07457968e-02
  7.02929092e+00  4.40559864e+00 -6.35502484e+00 -3.38758372e+00
 -3.26790432e-01 -5.75357277e+00 -1.50476658e+00 -1.74640419e+03
 -8.84496841e+02  2.72847012e+03  1.21876252e+03 -1.60482702e+02
  3.43241638e+03 -4.55659104e+02  1.29128645e+01 -1.21003356e+02
 -2.25104872e+01 -8.80386939e+00  1.01576544e+03 -2.44957547e+01
  1.10434095e+02  1.24225327e+02  5.27561524e+00 -2.78938235e+03
  2.00063207e+01  9.00507173e+00  9.92973316e+00 -1.34953908e+03
  2.80969132e+01 -1.78625212e+00  1.71229663e+02 -4.46361474e+00
 -1.70701042e+03  4.80390214e+02 -1.17164491e+00]
(poly deg 2) linear model intercept (b): 519.480
(poly deg 2) R-squa

### Polynomial and Ridge

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                   random_state = 10)
linreg_ridge = Ridge().fit(X_train, y_train)

print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.6f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.6f}'
     .format(linreg.score(X_test, y_test)))

(poly deg 2 + ridge) linear model coeff (w):
[-1.01508354e-09  1.01925222e+03  7.35329736e+00  1.21097974e+02
 -7.07657910e+01 -2.24487039e+01  4.79067507e+01  2.07232582e+00
 -1.47822004e+02 -2.49639427e+01  5.61323902e+01  3.73327257e+00
 -2.64992576e+03  5.79069030e+01 -1.62118208e+02 -5.67033063e+01
 -8.83668923e+00  2.66904224e+03 -2.68312692e+01  2.07457968e-02
  7.02929092e+00  4.40559864e+00 -6.35502484e+00 -3.38758372e+00
 -3.26790432e-01 -5.75357277e+00 -1.50476658e+00 -1.74640419e+03
 -8.84496841e+02  2.72847012e+03  1.21876252e+03 -1.60482702e+02
  3.43241638e+03 -4.55659104e+02  1.29128645e+01 -1.21003356e+02
 -2.25104872e+01 -8.80386939e+00  1.01576544e+03 -2.44957547e+01
  1.10434095e+02  1.24225327e+02  5.27561524e+00 -2.78938235e+03
  2.00063207e+01  9.00507173e+00  9.92973316e+00 -1.34953908e+03
  2.80969132e+01 -1.78625212e+00  1.71229663e+02 -4.46361474e+00
 -1.70701042e+03  4.80390214e+02 -1.17164491e+00]
(poly deg 2 + ridge) linear model intercept (b): 519.480
(po

### Polynomial and Lasso

In [96]:
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                   random_state = 10)
linreg = Lasso().fit(X_train, y_train)

print('(poly deg 2 + Lasso) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + Lasso) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + Lasso) R-squared score (training): {:.6f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2 + Lasso) R-squared score (test): {:.6f}'
     .format(linreg.score(X_test, y_test)))

(poly deg 2 + Lasso) linear model coeff (w):
[ 0.00000000e+00  7.88910032e+02  0.00000000e+00 -5.52120910e+00
 -0.00000000e+00 -0.00000000e+00  1.10532072e+02  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  5.52563301e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  1.22512872e-02
  0.00000000e+00  0.00000000e+00  0.00000000e+00 -0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.67050125e+01  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -8.57013721e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  8.34477692e+00 -0.00000000e+00  4.42876253e-01  0.00000000e+00
  4.90179774e-01 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00]
(poly deg 2 + Lasso) linear model intercept (b): 497.613
(po

# Support Vector regression( kernelized)

## Simple linear SVR

In [97]:
X_train, X_test, y_train, y_test = train_test_split(mat1,y, random_state = 10)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
param_svr = {'C':[1, 10, 50, 100],
            'epsilon':[0.5,1,5,10]} 
grid_search_svr = GridSearchCV(SVR(kernel='linear'),param_svr, cv=5) # each parameter has a cv of 5 folds. Try max_iter if possible.
grid_search_svr.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search_svr.best_params_)) 
print("Best cross-validation score: {:.6f}".format(grid_search_svr.best_score_)) 
print("Best estimator:\n{}".format(grid_search_svr.best_estimator_))

Best parameters: {'C': 100, 'epsilon': 5}
Best cross-validation score: 0.999643
Best estimator:
SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=5, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)


In [98]:
results = pd.DataFrame(grid_search_svr.cv_results_)
scores_svr = np.array(results.mean_test_score)
train_score = grid_search_svr.score(X_train,y_train)
test_score = grid_search_svr.score(X_test,y_test)
print(train_score, test_score)

0.9996797268225602 0.9997061474781955




### Sample plots of C & Epsilon in linear SVR

In [99]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

fig, subaxes = plt.subplots(4, 4, figsize=(18, 10), dpi=50)
xtemp= X['btc_trade_volume'].reshape(-1,1)
xtemp= sc.fit_transform(xtemp)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(xtemp, y,
                                                   random_state = 10)

for this_epsilon, this_axis in zip([0.5,1,5,10], subaxes):
    for this_C, subplot in zip([1, 10, 50, 100], this_axis):
        clf = SVR(kernel = 'linear', epsilon = this_epsilon,
                 C = this_C).fit(X_train_temp, y_train_temp)
        y_predict_output = clf.predict(X_test_temp)
        subplot.plot(X_test_temp, y_predict_output)
        subplot.plot(X_train_temp, y_train_temp, 'o', alpha=0.9, label='train_temp')
        subplot.plot(X_test_temp, y_test_temp, '^', alpha=0.9, label='test_temp')
        subplot.set_title('epsilon = {:.2f}, C = {:.2f}'.format(this_epsilon, this_C))
        subplot.set_xlim([-1, 10])
        subplot.legend()
        plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

<IPython.core.display.Javascript object>

  """


## Kernelized SVR( RBF)

In [100]:
X_train, X_test, y_train, y_test = train_test_split(mat1,y, random_state = 10)
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
param_rbf = {'C':[ 10, 50, 100],
             'gamma':[0.1,1,10],
            'epsilon':[1,5,10]} 

grid_search_rbf = GridSearchCV(SVR(kernel='rbf'),param_rbf, cv=5) # each parameter has a cv of 10 folds. Try max_iter if possible.
grid_search_rbf.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search_rbf.best_params_)) # The pairt hta has highest score
print("Best cross-validation score: {:.6f}".format(grid_search_rbf.best_score_)) 
print("Best estimator:\n{}".format(grid_search_rbf.best_estimator_))

Best parameters: {'C': 100, 'epsilon': 10, 'gamma': 0.1}
Best cross-validation score: 0.948515
Best estimator:
SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=10, gamma=0.1,
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)


In [101]:
results = pd.DataFrame(grid_search_rbf.cv_results_)
scores_rbf = np.array(results.mean_test_score)
train_score = grid_search_rbf.score(X_train,y_train)
test_score = grid_search_rbf.score(X_test,y_test)
print(train_score, test_score)

0.9667566748340779 0.9717546818642124




### Sample plots of C, Gamma, & Epsilon in Kernelized RBF Regression

#### Gamma = 0.1

In [102]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

fig, subaxes = plt.subplots(2, 2, figsize=(15, 8), dpi=50)
xtemp= X['btc_trade_volume'].reshape(-1,1)
xtemp= sc.fit_transform(xtemp)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(xtemp, y,
                                                   random_state = 10)

for this_epsilon, this_axis in zip([1,10], subaxes):
    for this_C, subplot in zip([ 10, 100], this_axis):
        gamma= 0.1
        rbf = SVR(kernel = 'rbf', gamma = gamma, epsilon = this_epsilon,
                 C = this_C).fit(X_train_temp, y_train_temp)
        y_predict_output = rbf.predict(X_test_temp)
        subplot.plot(X_test_temp, y_predict_output)
        subplot.plot(X_train_temp, y_train_temp, 'o', alpha=0.9, label='train_temp')
        subplot.plot(X_test_temp, y_test_temp, '^', alpha=0.9, label='test_temp')
        subplot.set_title('gamma = {:.2f}, epsilon = {:.2f}, C = {:.2f}'.format(gamma, this_epsilon, this_C))
        subplot.set_xlim([-1, 10])
        subplot.legend()
        plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)


<IPython.core.display.Javascript object>

  """


#### Gamma = 10

In [103]:
fig, subaxes = plt.subplots(2, 2, figsize=(15, 8), dpi=50)
xtemp= X['btc_trade_volume'].reshape(-1,1)
xtemp= sc.fit_transform(xtemp)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(xtemp, y,
                                                   random_state = 10)

for epsilon, this_axis in zip([1,10], subaxes):
    for C, subplot in zip([ 10, 100], this_axis):
        gamma= 10
        rbf1 = SVR(kernel = 'rbf', gamma = gamma, epsilon = epsilon,
                 C = C).fit(X_train_temp, y_train_temp)
        y_predict_output = rbf1.predict(X_test_temp)
        subplot.plot(X_test_temp, y_predict_output)
        subplot.plot(X_train_temp, y_train_temp, 'o', alpha=0.9, label='train_temp')
        subplot.plot(X_test_temp, y_test_temp, '^', alpha=0.9, label='test_temp')
        subplot.set_title('gamma = {:.2f}, epsilon = {:.2f}, C = {:.2f}'.format(gamma, epsilon, C))
        subplot.set_xlim([-1, 10])
        subplot.legend()
        plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

<IPython.core.display.Javascript object>

  


# Predict Market Price of Bitcoin (Polynomial)
 After evaluating the highest testing score, Polynomial is the best model in use. Test score(R^2) is 0.99998 for Polynomial.

In [104]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures


### Training Set: Standardization & Data transformation( quadratic form)¶

In [105]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(mat1)

X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 10)
linreg = LinearRegression().fit(X_train, y_train)

print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.6f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.6f}\n'
     .format(linreg.score(X_test, y_test)))


(poly deg 2) linear model coeff (w):
[-1.01508354e-09  1.01925222e+03  7.35329736e+00  1.21097974e+02
 -7.07657910e+01 -2.24487039e+01  4.79067507e+01  2.07232582e+00
 -1.47822004e+02 -2.49639427e+01  5.61323902e+01  3.73327257e+00
 -2.64992576e+03  5.79069030e+01 -1.62118208e+02 -5.67033063e+01
 -8.83668923e+00  2.66904224e+03 -2.68312692e+01  2.07457968e-02
  7.02929092e+00  4.40559864e+00 -6.35502484e+00 -3.38758372e+00
 -3.26790432e-01 -5.75357277e+00 -1.50476658e+00 -1.74640419e+03
 -8.84496841e+02  2.72847012e+03  1.21876252e+03 -1.60482702e+02
  3.43241638e+03 -4.55659104e+02  1.29128645e+01 -1.21003356e+02
 -2.25104872e+01 -8.80386939e+00  1.01576544e+03 -2.44957547e+01
  1.10434095e+02  1.24225327e+02  5.27561524e+00 -2.78938235e+03
  2.00063207e+01  9.00507173e+00  9.92973316e+00 -1.34953908e+03
  2.80969132e+01 -1.78625212e+00  1.71229663e+02 -4.46361474e+00
 -1.70701042e+03  4.80390214e+02 -1.17164491e+00]
(poly deg 2) linear model intercept (b): 519.480
(poly deg 2) R-squa

### Test Set : Standardization & Data transformation( quadratic form)

In [106]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
mat1 = sc.fit_transform(X)

X_predict= test.loc[:,colnames_selected]
X_predict = sc.transform(X_predict)

poly = PolynomialFeatures(degree=2)
X_predict_poly = poly.fit_transform(X_predict)

### Prediction & Visualization


In [107]:
linreg.fit(X_poly,y)
train_score=linreg.score(X_poly,y)
test_predict=linreg.predict(X_predict_poly)
test_predict = pd.DataFrame(test_predict,index=test['Date'],columns=['Prediction'])
print('train_score=',train_score)
test_predict

train_score= 0.9999712253947205


Unnamed: 0_level_0,Prediction
Date,Unnamed: 1_level_1
2/1/2018 0:00,8810.074889
2/2/2018 0:00,8432.447633
2/3/2018 0:00,8987.916108
2/4/2018 0:00,8135.421173
2/5/2018 0:00,6278.001579
2/6/2018 0:00,6421.209727
2/7/2018 0:00,7859.788307
2/8/2018 0:00,8103.854414
2/9/2018 0:00,8404.879651
2/10/2018 0:00,8305.587978


In [108]:
test_predict.index = pd.to_datetime(test_predict.index) # change index to date/time format
test_predict.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x20bd4f43be0>