# Cleaning Cryptocurrency data

In this notebook, we are going to merge the data as best as possible with the previous CSV files opened in the previous notebook. As we know, all of them except `bitcoin_dataset.csv` and `ethereum_dataset.csv` have the same columns. These columns are the following:

* Date: date of observation
* Open: Opening price on the given day
* High: Highest price on the given day
* Low: Lowest price on the given day
* Close: Closing price on the given day
* Volume: Volume of transactions on the given day
* Market Cap: Market capitalization in USD

In [87]:
from pathlib import Path
from subprocess import check_output

import pandas as pd
import glob

In [88]:
DATA_PATH = Path('../../data/raw/cryptocurrencypricehistory')
PROCESSED_DATA_PATH = Path('../../data/processed/cryptocurrencypricehistory')

PROCESSED_DATA_PATH.mkdir(exist_ok=True, parents=True)

Then we can see the different files that they need to be cleaned by us.

In [89]:
print(check_output(["ls", DATA_PATH]).decode("utf8"))

bitcoin_cash_price.csv
bitcoin_dataset.csv
bitcoin_price.csv
bitconnect_price.csv
dash_price.csv
ethereum_classic_price.csv
ethereum_dataset.csv
ethereum_price.csv
iota_price.csv
litecoin_price.csv
monero_price.csv
nem_price.csv
neo_price.csv
numeraire_price.csv
omisego_price.csv
qtum_price.csv
ripple_price.csv
stratis_price.csv
waves_price.csv



For instance, if we open a file and read then, we can see the different attributes as we said before. 

In [90]:
df_0 = pd.read_csv(DATA_PATH / 'bitcoin_cash_price.csv', index_col='Date')

df_0.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Market Cap
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Sep 05, 2017",514.9,550.95,458.78,541.71,338978000,8527100000
"Sep 04, 2017",608.26,608.26,500.75,517.24,328957000,10072200000
"Sep 03, 2017",578.27,617.41,563.59,607.43,344862000,9574520000
"Sep 02, 2017",621.96,642.05,560.58,575.9,350478000,10297000000
"Sep 01, 2017",588.4,645.52,586.73,622.17,393839000,9740460000


The value that we take into account on this table will be the `Close` value.

Now, we are going to use the `merge` method to concatenate each column of the dataset by `date`. For each currency we will get the `Close` value. Let's try with before example and all of currency with the same columns.

After that, for each column we will rename the column to each cryptocurrency name to distinguish them.

In [91]:
col_list = ["Date", "Close"]
num = 1

path = r'../../data/raw/cryptocurrencypricehistory'
files = glob.glob(path + "/*.csv")

df_cc = pd.read_csv(DATA_PATH / 'bitcoin_cash_price.csv', index_col='Date', usecols=col_list)
df_cc.rename(columns={df_cc.columns[0]:'bitcoin_cash_price'}, inplace=True)

for filename in files:
    if filename != (path + '/bitcoin_dataset.csv') and filename != (path + '/ethereum_dataset.csv'):
        nf = filename.split('/')
        nf = nf[5].split('.')

        df = pd.read_csv(filename, index_col='Date', usecols=col_list)
        df_cc = pd.concat([df_cc, df], axis=1)
        #df_cc = df_cc.merge(df['Close'], 'left', 'Date')

        df_cc.rename(columns={df_cc.columns[num]:nf[0]}, inplace=True)

        num += 1

df_cc.head()

Unnamed: 0,bitcoin_cash_price,bitconnect_price,iota_price,ripple_price,qtum_price,dash_price,neo_price,monero_price,numeraire_price,bitcoin_price,stratis_price,bitcoin_cash_price.1,waves_price,litecoin_price,ethereum_price,nem_price,omisego_price,ethereum_classic_price
"Sep 05, 2017",541.71,129.42,0.613085,0.215189,11.71,327.23,22.8,118.82,21.53,4376.53,6.03,541.71,4.98,71.29,312.99,0.286227,10.98,16.58
"Sep 04, 2017",517.24,114.13,0.566472,0.204968,10.98,316.13,21.83,106.17,20.74,4236.31,5.77,517.24,4.66,65.21,295.17,0.27322,8.78,15.84
"Sep 03, 2017",607.43,130.99,0.743968,0.228811,15.29,356.39,30.32,126.01,26.9,4582.96,6.59,607.43,5.2,76.84,347.48,0.307264,10.57,18.57
"Sep 02, 2017",575.9,131.33,0.695547,0.226669,16.39,350.17,31.72,124.8,27.24,4578.77,6.34,575.9,5.23,79.02,348.98,0.295884,10.8,20.08
"Sep 01, 2017",622.17,140.97,0.807778,0.248479,18.26,393.35,32.01,141.2,32.45,4892.01,7.25,622.17,5.74,86.04,387.74,0.33231,11.97,21.94


Now, we have a table of all close prices of each type of cryptocurrency except the `bitcoin_dataset.csv` and `ethereum_dataset.csv`. 

Then we are going to check if exists NaN values into the rows.

In [92]:
print('Rows containing NaN:', df_cc.isna().any(axis=1).sum())
df_cc.isna().any(axis=0)

Rows containing NaN: 1562


bitcoin_cash_price         True
bitconnect_price           True
iota_price                 True
ripple_price               True
qtum_price                 True
dash_price                 True
neo_price                  True
monero_price               True
numeraire_price            True
bitcoin_price             False
stratis_price              True
bitcoin_cash_price         True
waves_price                True
litecoin_price            False
ethereum_price             True
nem_price                  True
omisego_price              True
ethereum_classic_price     True
dtype: bool

Also, we are going to repeat the process for `bitcoin_dataset.csv` and `ethereum_dataset.csv`. 

In [93]:
df_bit = pd.read_csv(DATA_PATH / 'bitcoin_dataset.csv', index_col='Date' )
print('--- Bitcoin Dataset ---')
print('Rows containing NaN:', df_bit.isna().any(axis=1).sum())
df_bit.isna().any(axis=0)

--- Bitcoin Dataset ---
Rows containing NaN: 478


btc_market_price                                       False
btc_total_bitcoins                                     False
btc_market_cap                                         False
btc_trade_volume                                        True
btc_blocks_size                                        False
btc_avg_block_size                                     False
btc_n_orphaned_blocks                                  False
btc_n_transactions_per_block                           False
btc_median_confirmation_time                           False
btc_hash_rate                                          False
btc_difficulty                                         False
btc_miners_revenue                                     False
btc_transaction_fees                                   False
btc_cost_per_transaction_percent                       False
btc_cost_per_transaction                               False
btc_n_unique_addresses                                 False
btc_n_transactions      

In [94]:
df_eth = pd.read_csv(DATA_PATH / 'ethereum_dataset.csv', index_col='Date(UTC)')
print('--- Ethereum Dataset ---')
print('Rows containing NaN:', df_eth.isna().any(axis=1).sum())
df_eth.isna().any(axis=0)

--- Ethereum Dataset ---
Rows containing NaN: 769


UnixTimeStamp        False
eth_etherprice       False
eth_tx               False
eth_address          False
eth_supply           False
eth_marketcap        False
eth_hashrate         False
eth_difficulty       False
eth_blocks           False
eth_uncles           False
eth_blocksize        False
eth_blocktime        False
eth_gasprice         False
eth_gaslimit         False
eth_gasused          False
eth_ethersupply      False
eth_chaindatasize     True
eth_ens_register      True
dtype: bool

As we can see, some of them have Nan values that they will be deleted in order to clean the data as best as possible. As we know in this case, exist two options to clean the NaN values:

* Fill each gap with a determined number
* Remove the row of this day

Delete a row is not a solution for the close values, we don't prefer to remove the row because other currencies can have value for this day and we decided to fill this value changing the value to 0.

In [95]:
df_cc.fillna(0., inplace=True)

On the `bitcoin_dataset.csv` and `ethereum_dataset.csv` we can replace too these rows that contains some NaN values. As we see before the first one have around 500 NaN values, all of them into one column, if we remove them we will not have many values to extract some results in the future. The second one, have around 800 NaN values and also we will not remove these rows because we prefer to keep data. 

In [96]:
df_bit.fillna(0., inplace=True)
df_eth.fillna(0., inplace=True)

Finally, we only save the data into a new csv file that we will save into the processed data directory defined in the previous steps.

In [97]:
df_cc.to_csv(PROCESSED_DATA_PATH / f'cryptocurrency_close_values.csv')
df_bit.to_csv(PROCESSED_DATA_PATH / f'bitcoin_dataset.csv')
df_eth.to_csv(PROCESSED_DATA_PATH / f'ethereum_dataset.csv')