## Bitcoin On-Chain Data Exploration

This data was gathered from Glass Node. https://glassnode.com/ 

Glass Node is a platform that collects on-chain data from the Bitcoin blockchain. The Bitcoin blockchain is an open ledger that can be utilized and analyzed based available information such as the amount Bitcoin being moved around vs being stored long term, the number of new wallets being created to hold Bitcoin, whether Bitcoin is being moved to trading exchanges, and the price at which each Bitcoin was originally purchased to know whether it is in profit or loss. 

All data has been downloaded as csv files for ease of use during the exploratory analysis and may be a more streamlined use case for the learning process. API access is available for use with Glass Node which will give the possibility of daily updates and more insight into the price prediction. 

15 independent variable datasets were downloaded. 1 dependent variable (Price) dataset downloaded.

**Data to be used: 14 independent variable datasets * roughly 3500 data points = 49000 independent data points**

**After first look at the datasets, further analysis and cleaning needed, but the data wrangling is complete for initial training.**



**Key Terms:**

Bitcoin (BTC) - a digital currency is mined by computers through solving complicated math problems which in turn allow all bitcoin transactions to be verified and placed on to an open, online, immutable, pseudo-anonymous digital ledger called the blockchain. Bitcoin can be subdivided, so that values such as 0.01 bitcoin and smaller exist. 

Wallets - a location for storing bitcoin. The wallet has a public address which is shared and assigned to bitcoin units through the blockchain. Bitcoin can only be removed from the wallet using a private key which is not stored on the blockchain.

Address - a unique 26-35 string of alphabetic and numerica characters. When bitcoin is sent to an address, that bitcoin is now linked to that address, which is the idea of a wallet for it cannot leave that address/wallet until the owner of the wallet provides the private key for the wallet.

Mining/Miners - computers that create bitcoin through solving complicated math problems and verifying transactions are valid with the rest of the public blockchain. Miners get rewarded in bitcoin when they solve the math problem which then allows the miner to add a block of bitcoin transactions to the blockchain, which is the open ledger of all bitcoin transactions.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Price
This is the dependent variable or label that will be used for the machine learning.

In [2]:
# import price data and rough analysis
price = pd.read_csv(r'C:\Users\dagar\Bitcoin_Data_Capstone\price.csv')
print(price.head(), '\n')
print('Number of missing values:\n', price.isna().sum(), '\n')
print(price.shape, '\n')
print(price.describe())
print('\nDate range is:', price.iloc[0,0], 'to',price.iloc[3982,0])

              timestamp     value
0  2010-07-17T00:00:00Z  0.049510
1  2010-07-18T00:00:00Z  0.085840
2  2010-07-19T00:00:00Z  0.080800
3  2010-07-20T00:00:00Z  0.074733
4  2010-07-21T00:00:00Z  0.079210 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3983, 2) 

              value
count   3983.000000
mean    4818.315246
std     9928.110057
min        0.049510
25%       85.049463
50%      581.290525
75%     6736.971907
max    63603.708172

Date range is: 2010-07-17T00:00:00Z to 2021-06-11T00:00:00Z


There are no missing values in the dataset. The date range is from July 17, 2010 to June 11, 2021. There are no values that look suspicious from the data description.

## SOPR

The Spent Output Profit Ratio (SOPR) is a daily cumulative value computed by dividing the price sold by the price paid for a unit of Bitcoin. This incidates that when SOPR is relatively high the price most units are being sold for is significantly higher than the price paid for those bitcoin originally. Vice versa, when SOPR is low the price the unit of Bitcoin is being sold is close or below the price it was previously purchased.

In [3]:
sopr = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/sopr.csv')
print(sopr.head(), '\n')
print('Number of missing values:\n', sopr.isna().sum(), '\n')
print(sopr.shape, '\n')
print(sopr.describe())
print('\nDate range is:', sopr.iloc[0,0], 'to',sopr.iloc[3981,0])

              timestamp     value
0  2010-07-17T00:00:00Z  1.000000
1  2010-07-18T00:00:00Z  1.174760
2  2010-07-19T00:00:00Z  1.318536
3  2010-07-20T00:00:00Z  1.090517
4  2010-07-21T00:00:00Z  1.065532 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3982, 2) 

             value
count  3982.000000
mean      1.010183
std       0.047503
min       0.647491
25%       0.997017
50%       1.003105
75%       1.012456
max       1.896209

Date range is: 2010-07-17T00:00:00Z to 2021-06-10T00:00:00Z


There are no missing values in the SOPR dataset. The date range is July 7, 2010 to June 10, 2021 which is one day less than the price.

## RHODL Ratio

The Realized "Hold On For Dear Life" (HODL) Ratio is a market indicator that uses a ratio of the Realized Cap HODL Waves. The term HODL references Bitcoin wallets that hold on to their Bitcoin rather than selling frequently, especially during price declines. The Realized Cap HODL Waves is a data metric which expresses the percentage of Bitcoin that have been held in a single wallet for various timespans (>10yr, 7-10yr, 5-7yr, 3-5yr, 2-3yr, 1-2yr, 6-12mo, 3-6mo, 1-3mo, 1w-1m, 1d-1w, 24h) The RHODL Ratio takes the ratio between the 1 week and the 1-2 years Realized Cap HODL Waves, and weights the ratio by the total market age to account for the number of new coins entering the market.

A high ratio tends to indicate an overheated market since that indicates that a lot of coins that were held for 1-2 years are being sold which usually implies that 1-2yr investors are taking profits at the expense of new buyers.

In [4]:
rhodl = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/rhodl-ratio.csv')
print(rhodl.head(), '\n')
print('Number of missing values:\n', rhodl.isna().sum(), '\n')
print(rhodl.shape, '\n')
print(rhodl.describe())
print('\nDate range is:', rhodl.iloc[0,0], 'to',rhodl.iloc[3950,0])

              timestamp     value
0  2010-08-17T00:00:00Z  0.202554
1  2010-08-18T00:00:00Z  0.388194
2  2010-08-19T00:00:00Z  0.581370
3  2010-08-20T00:00:00Z  0.735485
4  2010-08-21T00:00:00Z  0.965442 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3951, 2) 

               value
count    3951.000000
mean     8885.391266
std     20411.471636
min         0.202554
25%       525.344392
50%      1853.958238
75%      8449.198447
max    219470.016332

Date range is: 2010-08-17T00:00:00Z to 2021-06-10T00:00:00Z


There are no missing values from the Rhodl dataset. The max value is significantly higher than the third quartile value. That along with the fact that the range is quite large may indicate some processing will be needed. The date range is from August 17, 2010 to June 06, 2021, which is a smaller date range from the previous datasets, but still a very a large dataset to provide insights. This may end up being the data range used.

## Reserve-Risk

Reserve Risk is defined as the price divided by HODL Bank. I am unclear on what constitutes the HODL Bank from the documentation, but I will tentatively assume it is the amount of Bitcoin being held in a wallet >1 year. The Reserve Risk metric can be used to assess confidence of long-term holders relative to the price. When the confidence is high and the price is low, which would be indicated by a low reserve risk (low price/high number of long term holders), then the prospect of purchasing Bitcoin tends to be very profitable.


In [5]:
res_risk = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/reserve-risk.csv')
print(res_risk.head(), '\n')
print('Number of missing values:\n', res_risk.isna().sum(), '\n')
print(res_risk.shape, '\n')
print(res_risk.describe())
print('\nDate range is:', res_risk.iloc[0,0], 'to', res_risk.iloc[3952,0])

              timestamp     value
0  2010-08-15T00:00:00Z  1.132036
1  2010-08-16T00:00:00Z  0.560768
2  2010-08-17T00:00:00Z  0.373186
3  2010-08-18T00:00:00Z  0.294741
4  2010-08-19T00:00:00Z  0.224639 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3953, 2) 

             value
count  3953.000000
mean      0.007465
std       0.023692
min       0.000970
25%       0.001956
50%       0.003420
75%       0.007418
max       1.132036

Date range is: 2010-08-15T00:00:00Z to 2021-06-10T00:00:00Z


There is no missing data from the Reserve Risk dataset.The unit of account for this metric looks like it may need to be adjusted. The range is quite significant and it may need adjusting. Here the date range is August 15, 2010 to June 10, 2021 which is a smaller range than the price data, but still a sizeable range for data analysis use.

## Puell Multiple

The Puell Multiple is calculated by the daily issuance value of bitcoins by the 365-day moving average of daily issuance value. This will reach high values when the price of bitcoin is well above the average price for the previous 365 days. This can signal the end of a bull market or the top of a bull market run. On the opposite end, when the Puell Multiple is low the price of bitcoin is well below the average price for the previous 365 days and can signal the bottom of a bear market.

In [6]:
puell = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/puell-multiple.csv')
print(puell.head(), '\n')
print('Number of missing values:\n', puell.isna().sum(), '\n')
print(puell.shape, '\n')
print(puell.describe())
print('\nDate range is:', puell.iloc[0,0], 'to',puell.iloc[3617,0])

              timestamp     value
0  2011-07-16T00:00:00Z  3.656330
1  2011-07-17T00:00:00Z  3.830945
2  2011-07-18T00:00:00Z  3.691357
3  2011-07-19T00:00:00Z  3.244558
4  2011-07-20T00:00:00Z  3.366804 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3618, 2) 

             value
count  3618.000000
mean      1.477589
std       1.092461
min       0.283478
25%       0.745061
50%       1.173240
75%       1.862356
max      10.167732

Date range is: 2011-07-16T00:00:00Z to 2021-06-10T00:00:00Z


The Puell Multiple does not have any missing data points. The range of data looks appropriate and a reasonable spread. The date range is from July 16, 2011 to June 10, 2021, which is the smallest date range so far. This can still be used since it represents 3618 data points and covers 90.9% of the price data points.

## NVT Ratio

The Network Value to Transactions Ratio is the bitcoin market cap divided by the transferred on-chain volume measured in USD. This gives insight into the percentage of bitcoin value that is being moved between parties on a daily basis. This metric gives some insight into the utility of Bitcoin compared to its price in USD.

In [7]:
nvt = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/nvt-ratio.csv')
print(nvt.head(), '\n')
print('Number of missing values:\n', nvt.isna().sum(), '\n')
print(nvt.shape, '\n')
print(nvt.describe())
print('\nDate range is:', nvt.iloc[0,0], 'to', nvt.iloc[3980,0])

              timestamp       value
0  2010-07-18T00:00:00Z   95.861836
1  2010-07-19T00:00:00Z   73.314284
2  2010-07-20T00:00:00Z  115.888962
3  2010-07-21T00:00:00Z  125.705527
4  2010-07-22T00:00:00Z   87.922220 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3981, 2) 

             value
count  3981.000000
mean     24.009796
std      26.190820
min       0.224882
25%      10.391571
50%      18.529059
75%      29.717838
max     448.150102

Date range is: 2010-07-18T00:00:00Z to 2021-06-10T00:00:00Z


There are no missing values for the NVT data. The data value range has some potential outliers and may need to be manipulated. The date range is July 18, 2010 to June 10, 2021, which gives a large number of data points.

## New Addresses

This is the number of unique addresses that apper for the first time on the blockchain for a bitcoin transaction. While new addresses do not indicate that the person buying is a first time bitcoin buyer, when taken in aggregate it will tend to show the overall trend of new buyers coming in or leaving the market based on its number. Higher number of new addresses gives the implication of a lot of market with many first time buyers. Lower number of new addresses gives the implication of a market with mostly experienced bitcoin buyers.

In [8]:
new_add = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/new-addresses.csv')
print(new_add.head(), '\n')
print('Number of missing values:\n', new_add.isna().sum(), '\n')
print(new_add.shape, '\n')
print(new_add.describe())
print('\nDate range is:', new_add.iloc[0,0], 'to', new_add.iloc[3574,0])

              timestamp  value
0  2011-08-28T00:00:00Z  11561
1  2011-08-29T00:00:00Z  11413
2  2011-08-30T00:00:00Z   9275
3  2011-08-31T00:00:00Z   8974
4  2011-09-01T00:00:00Z   8721 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3575, 2) 

               value
count    3575.000000
mean   234649.834126
std    168169.468753
min      4241.000000
25%     75776.000000
50%    243043.000000
75%    367866.500000
max    800180.000000

Date range is: 2011-08-28T00:00:00Z to 2021-06-10T00:00:00Z


The New Addresses data does not have any missing data. The range of values looks appropriate. The date range is August 28, 2011 to June 10, 2021 which is the smallest date range so far yet still covers 90% of the price data, thus big enough for analysis

## NUPL
Net Unrealized Profit/Loss is the count of the number of bitcoins that have relative unrealized profit minus count of the number of bitcoins with relative unrealized loss. A unit of bitcoin is deemed an unrealized profit if the current price minus the previous purchase price results in a positive value, a profit. The relative unrealized loss is the same calculation but where the result is a negative value, a loss. The data is then normalized against the total cumulative number of bitcoin transactions.

This metric gives insight into how many people are hodling bitcoin which they can sell for a profit, which when that number gets high, could predict a sell off for those wanting to realize their gains.

In [9]:
nupl = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/nupl.csv')
print(nupl.head(), '\n')
print('Number of missing values:\n', nupl.isna().sum(), '\n')
print(nupl.shape, '\n')
print(nupl.describe())
print('\nDate range is:', nupl.iloc[0,0], 'to', nupl.iloc[3980,0])

              timestamp     value
0  2010-07-18T00:00:00Z  0.421756
1  2010-07-19T00:00:00Z  0.380821
2  2010-07-20T00:00:00Z  0.328775
3  2010-07-21T00:00:00Z  0.365481
4  2010-07-22T00:00:00Z  0.102038 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3981, 2) 

             value
count  3981.000000
mean      0.331661
std       0.318370
min      -1.556947
25%       0.195707
50%       0.404822
75%       0.553495
max       0.877967

Date range is: 2010-07-18T00:00:00Z to 2021-06-10T00:00:00Z


There are no missing values in the NUPL dataset. The range of values looks fairly good though the minimum will have to be investigated more as it seems a bit odd to be smaller than -1.0 for a normalized dataset. The date range is almost equal to the price date range.

## MVRV Ratio

Market Value to Realized Value (MVRV) is the ratio between market cap and realized cap. Market cap is calculated by multiplying the latest Bitcoin price by the total number of bitcoins mined thus far. The realized cap is calculated by weighing the count of each unit of bitcoin based on the market price at which the unit of bitcoin was last transacted. When the MVRV value is high, the implication is that bitcoin is overvalued and the market cap is significantly higher than value of the coins based on their purchase price. Conversely. when the MVRV value is low, the implication is that bitcoin is undervalued.

This can express the sentiment of speculators vs holders or high time preference vs low time preference. 


In [10]:
mvrv_ratio = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/mvrv-ratio.csv')
print(mvrv_ratio.head(), '\n')
print('Number of missing values:\n', mvrv_ratio.isna().sum(), '\n')
print(mvrv_ratio.shape, '\n')
print(mvrv_ratio.describe())
print('\nDate range is:', mvrv_ratio.iloc[0,0], 'to', mvrv_ratio.iloc[3981,0])

              timestamp     value
0  2010-07-17T00:00:00Z  1.000000
1  2010-07-18T00:00:00Z  1.300600
2  2010-07-19T00:00:00Z  1.814750
3  2010-07-20T00:00:00Z  1.586253
4  2010-07-21T00:00:00Z  1.399411 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3982, 2) 

             value
count  3982.000000
mean      1.842166
std       0.892532
min       0.399227
25%       1.248235
50%       1.674141
75%       2.233915
max       7.120453

Date range is: 2010-07-17T00:00:00Z to 2021-06-10T00:00:00Z


There is no missing data from the MVRV dataset. The description of the data looks like an appropriate range of data. The date range is from July 17, 2010 to June 10, 2021 which includes almost the entire date range for price.

## MVRV Z-Score

The MVRV Z-Score is the ratio between the difference of market cap and realized cap divided by the standardized market cap. (market cap - realized cap) / std(market cap).

This can indicate a market top when the market value is significantly higher than the realized value. Where as whent he market value is significantly lower than the realized value then is can indicate the market bottom.

In [11]:
mvrv_z = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/mvrv-z-score.csv')
print(mvrv_z.head(), '\n')
print('Number of missing values:\n', mvrv_z.isna().sum(), '\n')
print(mvrv_z.shape, '\n')
print(mvrv_z.describe())
print('\nDate range is:', mvrv_z.iloc[0,0], 'to', mvrv_z.iloc[3980,0])

              timestamp     value
0  2010-07-18T00:00:00Z  2.518053
1  2010-07-19T00:00:00Z  2.097027
2  2010-07-20T00:00:00Z  1.651545
3  2010-07-21T00:00:00Z  1.927019
4  2010-07-22T00:00:00Z  0.885502 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3981, 2) 

             value
count  3981.000000
mean      1.698312
std       1.911428
min      -0.830057
25%       0.459079
50%       1.208088
75%       2.320322
max      12.547296

Date range is: 2010-07-18T00:00:00Z to 2021-06-10T00:00:00Z


There is no missing data from the MVRV Z-Score dataset. The range of data looks appropriate for this metric. The date range is from July 18, 2010 to June 10, 2021. This range is almost the same amount as the price dataset.

## Hash Ribbon

The Hash Ribbon calculates the spread between the 30-day Moving Average (30dMA) and the 60-day Moving Average (60dMA) for the hash rate of bitcoin miners. When the 30dMA goes above the 60dMA then this usually indicates a change is price momentum from negative to positive. This is indicated by more bitcoin mining activity over the near term (30 days) vs the long term (60 days). Bitcoin miners will mine more when they expect more profits from mining and that is really what this market indicator tries to express. 

In [12]:
hash_ribbon = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/hash-ribbon.csv')
print(hash_ribbon.head(), '\n')
print('Number of missing values:\n', hash_ribbon.isna().sum(), '\n')
print(hash_ribbon.shape, '\n')
print(hash_ribbon.describe())
print('\nDate range is:', hash_ribbon.iloc[0,0], 'to', hash_ribbon.iloc[4475,0])

              timestamp  buy  capitulation  crossed          ma30  \
0  2009-03-10T00:00:00Z    0             1        0  5.722585e+06   
1  2009-03-11T00:00:00Z    0             1        0  5.688013e+06   
2  2009-03-12T00:00:00Z    0             1        0  5.685052e+06   
3  2009-03-13T00:00:00Z    0             1        0  5.651161e+06   
4  2009-03-14T00:00:00Z    0             1        0  5.635171e+06   

           ma60  
0  5.773043e+06  
1  5.838596e+06  
2  5.858933e+06  
3  5.874558e+06  
4  5.864596e+06   

Number of missing values:
 timestamp       0
buy             0
capitulation    0
crossed         0
ma30            0
ma60            0
dtype: int64 

(4476, 6) 

               buy  capitulation      crossed          ma30          ma60
count  4476.000000   4476.000000  4476.000000  4.476000e+03  4.476000e+03
mean      0.030831      0.148794     0.010724  2.395988e+19  2.342158e+19
std       0.172879      0.355925     0.103011  4.388612e+19  4.314600e+19
min       0.00000

This dataset is significantly different from the previous datasets. The three columns are to indicate whether it is a good time to buy, miners are capitulating, and whether the ma30 has crossed above the ma60. This data will need to be processed. As of now, I am thinking of creating one column that has the difference between the 30dMA and 60dMA for hash rate. When that difference is positive the 30dMA is above the 60dMA, which will be an indicator. 

The date range is from March 10 2009 to June 10, 2021. This is the largest date range and will need to be changed.

## Exchange Net Flow Volume

The Exchange Net Flow Volume is the difference between the volume of bitcoin flowing in to trading exchanges compared to the volume of bitcoin flowing out of trading exchanges. This metric can give insight to whether a large sell off of bitcoin is soon to happen, since the coins must be moved to from storage wallets to exchanges prior to selling. As well, new bitcoin users tend to keep more coins on the exchange from which they purchased the coins out of convenience and inexperience. More experienced buyers transfer their purchased coins into storage wallets that are more secure but a bit more complicated to use.

This dataset does have some weakness in that there is no way to know through blockchain analytics that all the exchange addresses are correctly accounted for. The new exchange addresses can be added and the data collection needs to properly label the new address but that requires extra steps done outside of the blockchain.

In [13]:
enfv = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/exchange-net-flow-volume.csv')
print(enfv.head(), '\n')
print('Number of missing values:\n', enfv.isna().sum(), '\n')
print(enfv.shape, '\n')
print(enfv.describe())
print('\nDate range is:', enfv.iloc[0,0], 'to', enfv.iloc[3573,0])

              timestamp     value
0  2011-08-29T00:00:00Z  8.456522
1  2011-08-30T00:00:00Z  0.000000
2  2011-08-31T00:00:00Z  3.010000
3  2011-09-01T00:00:00Z  5.000000
4  2011-09-02T00:00:00Z  0.090000 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3574, 2) 

              value
count   3574.000000
mean     725.152970
std     7531.291254
min   -69789.449136
25%    -1742.075173
50%      232.011620
75%     2808.306393
max    96773.475258

Date range is: 2011-08-29T00:00:00Z to 2021-06-10T00:00:00Z


The range of data for Exchange Net Flow Volume looks to be a good range. The date range is from August 29, 2011 to June 10, 2021 which is the smallest date range so far. The ENFV data covers 90% of the price data, thus big enough for analysis

## Difficulty Ribbon

The Difficulty Ribbon consists of the 200, 128, 90, 60, 40, 25, and 14-day moving averages of the mining difficulty. The mining difficulty increases as mining computers have higher processing speeds and also when more mining computers come online to mine in the blockchain. When the multi-day moving averages for mining difficulty consolidate towards the same value that usually indicates a good time to buy bitcoin and a price increase is likely to follow.

In [14]:
diff_ribbon = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/difficulty-ribbon.csv')
print(diff_ribbon.head(1), '\n')
print('Number of missing values:\n', diff_ribbon.isna().sum(), '\n')
print(diff_ribbon.shape, '\n')
print(diff_ribbon.describe())
print('\nDate range is:', diff_ribbon.iloc[0,0], 'to', diff_ribbon.iloc[1471,0])

              timestamp                   ma128                    ma14  \
0  2017-05-31T00:00:00Z  2090873105297871500000  2504319261575269000000   

                    ma200                    ma25                    ma40  \
0  1809251068213638500000  2441060540318510000000  2365735778685950000000   

                     ma60                     ma9                    ma90  
0  2306104330445788800000  2559465144851490000000  2211896313063676000000   

Number of missing values:
 timestamp    0
ma128        0
ma14         0
ma200        0
ma25         0
ma40         0
ma60         0
ma9          0
ma90         0
dtype: int64 

(1472, 9) 

                   timestamp                    ma128  \
count                   1472                     1472   
unique                  1472                     1472   
top     2018-01-11T00:00:00Z  79356108812926400000000   
freq                       1                        1   

                           ma14                    ma200  \
count

There is no missing data from this dataset. The data here needs to be processed to clarify the idea. The most likely processing will be to get the difference between all the Moving Averages, therefor when the difference is small they have consolidated and when the value is larger the values are further apart.

A concern here is the date range. Much smaller than the other metrics this will have to be considered. It only covers 37% of the price data

## CVDD

Cumulative Value-Days Destroyed (CVDD) is the ratio of the cumulative USD value of Coin Days Destroyed and the market age in days. Coin Days Destroyed (CDD) is calculated by taking the number of coins in a transaction and multiplying it by the number of days it has been since those coins were last spent. The CDD metric gives an analysis of bitcoin transaction volume but gives more weight to the coins which have not been spent in a long time. It can be a sign that the market is turning bearish when you see the CDD value be a high value consistently. Now, the CVDD gives USD value to the CDD and then divides by the market age in days, which gives an insight into the value released by the CDD. The CVDD has traditionally been floor for the price during bear markets and can help indicate a bear market bottom or consolidation phase.

In [20]:
cvdd = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/cvdd.csv')
print(cvdd.head(1), '\n')
print('Number of missing values:\n', cvdd.isna().sum(), '\n')
print(cvdd.shape, '\n')
print(cvdd.describe())
print('\nDate range is:', cvdd.iloc[0,0], 'to', cvdd.iloc[3981,0])

              timestamp         value
0  2010-07-17T00:00:00Z  9.583721e-09 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(3982, 2) 

              value
count  3.982000e+03
mean   1.344208e+03
std    2.007571e+03
min    9.583721e-09
25%    6.929988e+00
50%    1.694949e+02
75%    2.657301e+03
max    9.546828e+03

Date range is: 2010-07-17T00:00:00Z to 2021-06-10T00:00:00Z


The CVDD dataset has no missing values. The range of data values has a wide spread but may be better for analysis if the earlier data is ignored since those values are very small and seem to be outliers. The date range is from July 17, 2010 to June 10, 2021. This represents most of the price data date range, and will likely be shrunk to accommodate for other data metric date limitations.

## CYD
Coin Years Destroyed is a 365 day rolling sum of the Coin Days Destroyed (CDD) described earlier. This gives insight into the amount of old coins traded in the market place. This can be seen as an indicator of long-term holder behaviour. When CYD is high, then more long-term investors are selling their coins. When CYD is low, long-term investors are holding onto their coins and most transactions are from recent investors.

In [22]:
cyd = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/cyd.csv')
print(cyd.head(1), '\n')
print('Number of missing values:\n', cyd.isna().sum(), '\n')
print(cyd.shape, '\n')
print(cyd.describe())
print('\nDate range is:', cyd.iloc[0,0], 'to', cyd.iloc[4168,0])

              timestamp         value
0  2010-01-11T00:00:00Z  3.461755e+06 

Number of missing values:
 timestamp    0
value        0
dtype: int64 

(4169, 2) 

              value
count  4.169000e+03
mean   2.521463e+09
std    1.578675e+09
min    3.461755e+06
25%    1.281882e+09
50%    2.459569e+09
75%    3.487977e+09
max    6.728864e+09

Date range is: 2010-01-11T00:00:00Z to 2021-06-10T00:00:00Z


There are no missing values for the CYD dataset. The data values all fall within a nice range, though early data points will likely be removed and that will make the data even more uniform. The date range covers most of the price data date range, though it will likely be narrowed to accommodate for other metrics.

## Address Balances

This metric measures the number of wallet addresses holding bitcoin. The addresses are divided into categories based on the amount of bitcoin held in the wallet (0.01 BTC up to 10,000 BTC). This metric is a measure of buying euphoria or mass adoption. If the number of small wallets relative to large wallets gets large, then that means more "Main Street" buyers are entering the market place, which indicates either euphoric speculation (get rich quick buyers) or mass adoption of bitcoin as a store of wealth. As well, when the number of the largest wallets begin to decrease, that indicates the large investors are taking profits and off-loading some bitcoin. Typically that leads to a market top for bitcoin price.

In [29]:
address_above_100 = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-100.csv')
address_above_10k = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-10k.csv')
address_above_10 = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-10.csv')
address_above_1k = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-1k.csv')
address_above_1 = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-1.csv')
address_above_0_1 = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-0-1.csv')
address_above_0_01 = pd.read_csv(r'C:/Users/dagar/Bitcoin_Data_Capstone/addresses-above-0-01.csv')

# concatenate the different dataframes into one dataframe
col_names = ['timestamp','>0.01', '>0.1', '>1', '>10', '>100', '>1k', '>10k']
address_data = pd.concat([address_above_0_01, address_above_0_1.iloc[: , 1], address_above_1.iloc[:,1], 
                          address_above_10.iloc[: , 1], address_above_100.iloc[:, 1], address_above_1k.iloc[:, 1],
                          address_above_10k.iloc[:, 1]], axis=1)

address_data.columns = col_names
print(address_data.head(5), '\n')
print('Number of missing values:\n', address_data.isna().sum(), '\n')
print(address_data.shape, '\n')
print(address_data.describe())
print('\nDate range is:', address_data.iloc[0,0], 'to', address_data.iloc[3574,0])

              timestamp   >0.01    >0.1      >1    >10  >100  >1k  >10k
0  2011-08-28T00:00:00Z  212838  135241  104760  65766  4233  641    81
1  2011-08-29T00:00:00Z  213432  135626  105052  65920  4259  640    80
2  2011-08-30T00:00:00Z  213985  136035  105391  66054  4271  640    80
3  2011-08-31T00:00:00Z  214506  136380  105571  66192  4296  640    80
4  2011-09-01T00:00:00Z  214595  136253  105366  66160  4325  640    80 

Number of missing values:
 timestamp    0
>0.01        0
>0.1         0
>1           0
>10          0
>100         0
>1k          0
>10k         0
dtype: int64 

(3575, 8) 

              >0.01          >0.1             >1            >10          >100  \
count  3.575000e+03  3.575000e+03    3575.000000    3575.000000   3575.000000   
mean   3.788288e+06  1.501435e+06  493099.252867  127430.827692  14884.260420   
std    3.064763e+06  1.072602e+06  244165.451847   28008.653875   3368.855658   
min    2.128380e+05  1.352410e+05  104760.000000   65766.000000   42

There are no missing data values for the Address dataset. The range of data has no suspicious values. The date range for the data is from August 28, 2011 to June 10, 2021 which represents 90% of the price data. This can reliably be used for analysis

## Summary

August 29, 2011 to June 20, 2021 will be the date range since that is the smallest range of data that covers at least 90% of the price data points. This date range is originated from the ENFV dataset.

The Difficulty Ribbon data will not be used since the date range is too limited and would only cover 37% of the price data.

The datasets are not missing any values. There is still further analysis of the data for processing of the values. Some may need to be normalized, outliers may need to be adjsuted for, and some data need to be converted from multiple columns to single columns.

**Further analysis and cleaning needed, but the data wrangling is complete for initial training.**

**Data to be used: 14 independent variable datasets * roughly 3500 data points = 49000 independent data points**