Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [0]:
# pd.set_option(display)

In [126]:
# bitcoin dataset from coinmetrics
bitcoin = pd.read_csv('https://raw.githubusercontent.com/coinmetrics-io/data/master/csv/btc.csv')
print(bitcoin.shape)
bitcoin.head()

(4097, 41)


Unnamed: 0,time,AdrActCnt,BlkCnt,BlkSizeByte,BlkSizeMeanByte,CapMVRVCur,CapMrktCurUSD,CapRealUSD,DiffMean,FeeMeanNtv,FeeMeanUSD,FeeMedNtv,FeeMedUSD,FeeTotNtv,FeeTotUSD,HashRate,IssContNtv,IssContPctAnn,IssContUSD,IssTotNtv,IssTotUSD,NVTAdj,NVTAdj90,PriceBTC,PriceUSD,ROI1yr,ROI30d,SplyCur,TxCnt,TxTfrCnt,TxTfrValAdjNtv,TxTfrValAdjUSD,TxTfrValMeanNtv,TxTfrValMeanUSD,TxTfrValMedNtv,TxTfrValMedUSD,TxTfrValNtv,TxTfrValUSD,VtyDayRet180d,VtyDayRet30d,VtyDayRet60d
0,2009-01-03,0.0,0.0,0.0,,,,0.0,,,,,,0.0,,,,,,,,,,1.0,,,,0.0,0.0,0.0,0.0,,,,,,0.0,,,,
1,2009-01-04,0.0,0.0,0.0,,,,0.0,,,,,,0.0,,,,,,,,,,1.0,,,,0.0,0.0,0.0,0.0,,,,,,0.0,,,,
2,2009-01-05,0.0,0.0,0.0,,,,0.0,,,,,,0.0,,,,,,,,,,1.0,,,,0.0,0.0,0.0,0.0,,,,,,0.0,,,,
3,2009-01-06,0.0,0.0,0.0,,,,0.0,,,,,,0.0,,,,,,,,,,1.0,,,,0.0,0.0,0.0,0.0,,,,,,0.0,,,,
4,2009-01-07,0.0,0.0,0.0,,,,0.0,,,,,,0.0,,,,,,,,,,1.0,,,,0.0,0.0,0.0,0.0,,,,,,0.0,,,,


#Clean data and simple EDA


In [0]:
bitcoin['time'] = pd.to_datetime(bitcoin['time'], infer_datetime_format=True)
bitcoin['year'] = bitcoin['time'].dt.year 

In [128]:
bitcoin.isnull().sum()

time                 0
AdrActCnt            0
BlkCnt               0
BlkSizeByte          0
BlkSizeMeanByte      6
CapMVRVCur         561
CapMrktCurUSD      561
CapRealUSD           0
DiffMean             6
FeeMeanNtv         258
FeeMeanUSD         561
FeeMedNtv          258
FeeMedUSD          561
FeeTotNtv            0
FeeTotUSD          561
HashRate             6
IssContNtv           6
IssContPctAnn        6
IssContUSD         561
IssTotNtv            6
IssTotUSD          561
NVTAdj             260
NVTAdj90           650
PriceBTC             0
PriceUSD           561
ROI1yr             926
ROI30d             591
SplyCur              0
TxCnt                0
TxTfrCnt             0
TxTfrValAdjNtv       0
TxTfrValAdjUSD     561
TxTfrValMeanNtv    258
TxTfrValMeanUSD    561
TxTfrValMedNtv     258
TxTfrValMedUSD     561
TxTfrValNtv          0
TxTfrValUSD        561
VtyDayRet180d      741
VtyDayRet30d       591
VtyDayRet60d       621
year                 0
dtype: int64

In [0]:
import plotly.express as px
import seaborn as sns

In [0]:
#px.line(bitcoin, 'time', 'PriceUSD')

In [131]:
bitcoin['year']

0       2009
1       2009
2       2009
3       2009
4       2009
        ... 
4092    2020
4093    2020
4094    2020
4095    2020
4096    2020
Name: year, Length: 4097, dtype: int64

In [132]:
# split the dataset
train = bitcoin[(bitcoin['year'] >= 2014) & (bitcoin['year'] < 2017)]
val = bitcoin[bitcoin['year'] == 2018]
test = bitcoin[bitcoin['year'] >= 2019]
train.shape, val.shape, test.shape

((1096, 42), (365, 42), (447, 42))

## GOAL
I would like to evaluate various metrics when it comes to trading strategy/price prediction. 

In [133]:
train.columns

Index(['time', 'AdrActCnt', 'BlkCnt', 'BlkSizeByte', 'BlkSizeMeanByte',
       'CapMVRVCur', 'CapMrktCurUSD', 'CapRealUSD', 'DiffMean', 'FeeMeanNtv',
       'FeeMeanUSD', 'FeeMedNtv', 'FeeMedUSD', 'FeeTotNtv', 'FeeTotUSD',
       'HashRate', 'IssContNtv', 'IssContPctAnn', 'IssContUSD', 'IssTotNtv',
       'IssTotUSD', 'NVTAdj', 'NVTAdj90', 'PriceBTC', 'PriceUSD', 'ROI1yr',
       'ROI30d', 'SplyCur', 'TxCnt', 'TxTfrCnt', 'TxTfrValAdjNtv',
       'TxTfrValAdjUSD', 'TxTfrValMeanNtv', 'TxTfrValMeanUSD',
       'TxTfrValMedNtv', 'TxTfrValMedUSD', 'TxTfrValNtv', 'TxTfrValUSD',
       'VtyDayRet180d', 'VtyDayRet30d', 'VtyDayRet60d', 'year'],
      dtype='object')

In [0]:
target = 'PriceUSD'
features = ['AdrActCnt','BlkSizeByte','BlkSizeMeanByte','TxTfrValMeanUSD','TxTfrValMeanNtv','HashRate', 'CapMVRVCur', 'SplyCur', 'TxCnt', 'TxTfrCnt', 'TxTfrValAdjNtv']

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

In [0]:
# ForrestRegression 
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error 
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

In [136]:
pip install --upgrade category_encoders

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.1.0)


In [0]:
import category_encoders as ce

In [138]:
X_train.columns

Index(['AdrActCnt', 'BlkSizeByte', 'BlkSizeMeanByte', 'TxTfrValMeanUSD',
       'TxTfrValMeanNtv', 'HashRate', 'CapMVRVCur', 'SplyCur', 'TxCnt',
       'TxTfrCnt', 'TxTfrValAdjNtv'],
      dtype='object')

In [142]:
pipeline = make_pipeline(
    # SimpleImputer(missing_values = 'NaN',strategy='mode'),
    RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_leaf=3, max_features='auto')
)
pipeline.fit(X_train, y_train)
print('MAE:', pipeline.score(X_train, y_train))
print('MAE:', pipeline.score(X_val, y_val))

MAE: 0.9711566341892283
MAE: -8.341628349137057
