# Data Wrangling
## Capstone Project One 
### C. Bonfield (Springboard Data Science Career Track)

For my first capstone project, I decided to try my hand at forecasting cryptocurrency closing prices using "traditional" cryptocurrency market data. This notebook details the "data wrangling" portion of the project - how I obtained my data (via the CryptoCompare API), how I cleaned my data (pulled data from multiple markets), and all of that good stuff! 

### Collecting Data via the CryptoCompare API

To grab all of the data that I needed, I wrote a script that pulled data via the CryptoCompare API. I'll include the code below just for the sake of demonstration:

In [1]:
# Import statements
import json
import requests
import datetime
import numpy as np
import pandas as pd

In [2]:
def construct_url(params):
    """
    Build the URL associated with the call to the cryptocompare.com API.

    Input:
        params: (same as documented below in pull_data)

    Returns:
        url: URL for query
        sym_string: cryptocurrency identifier (just a label for later)
        exchange: (self-explanatory - also a label for later)
    """
    base_url = 'https://min-api.cryptocompare.com/data/histohour?'
    fsym, tsym = params['syms']
    agg = params['aggregate']
    lim = params['limit']
    exchange = params['exchange']

    ext_url = 'fsym=' + fsym + '&tsym=' + tsym + '&limit=' + lim + '&aggregate=' \
              + agg + '&e=' + exchange

    url = base_url + ext_url
    sym_string = fsym + '_' + tsym

    return url, sym_string, exchange

In [3]:
def dateparse(epoch_time):
    """
    Convert from epoch to human date (UTC).
    """
    return datetime.datetime.fromtimestamp(float(epoch_time))

In [4]:
def pull_data(params):
    """
    Call API using url generated by construct_url, add a few additional
    columns as labels.

    Inputs:
        params: parameters to pass in the query
            fsym: 'from' symbol (probably the cryptocurrency symbol)
            tsym: 'to' symbol (likely USD)
            limit: number of time points to return (max: 2000)
            e: exchange (Coinbase, Poloniex, etc. - refer to API documentation
               for an exhaustive list)

    Returns:
        data: data frame containing data from API call
    """

    url, s, e = construct_url(params)

    response = requests.get(url)
    response.raise_for_status()         # Raise exception if invalid response.
    json_response = response.json()

    data = pd.DataFrame(json_response['Data'])
    data['fsym_tsym'] = s
    data['exchange'] = e

    return data

In [5]:
def clean_data(df):
    """
    Make data pulled from API neater. As written, this function simply averages
    features for each cryptocurrency across the five exchanges that I've
    chosen to use.

    Inputs:
        df: data from API (straight from CSV saved after double loop below)

    Returns:
        new_df: cleaned DF (only average closing prices for each cryptocurrency)
    """
    # Set index.
    df.set_index('time', inplace=True)

    # Treat missing values.
    df.replace(0.0, np.nan, inplace=True)

    # Construct new features.
    df['volume'] = df['volumeto'] - df['volumefrom']
    df['fluctuation'] = (df['high']-df['low']) / (df['open'])
    df['relative_hl_close'] = (df['close']-df['low']) / (df['high']-df['low'])

    # Select only relevant columns.
    sub_df = df[['close','volume','fluctuation','relative_hl_close',
                 'exchange','fsym_tsym']]

    # Average over exchanges for all features.
    group_df = sub_df.groupby([sub_df.index,'fsym_tsym']).agg({'close':[np.nanmean],'volume':[np.nanmean],'fluctuation':[np.nanmean],'relative_hl_close':[np.nanmean]})

    # Drop an irrelevant label, construct hierarchical label for columns.
    group_df.columns = group_df.columns.droplevel(level=1)
    new_df = group_df.unstack(level='fsym_tsym')

    return new_df

In [6]:
exchanges = ['COINBASE', 'POLONIEX', 'KRAKEN', 'BITSTAMP', 'BITFINEX']
sym_pairs = [('BTC','USD'),('ETH','USD'), ('LTC','USD'), ('DASH','USD'),
             ('XMR','USD')]

full_df = pd.DataFrame() # initialize empty data frame
for sp in sym_pairs:
    for exc in exchanges:
        request_dict = {'syms': sp, 'aggregate': '1', 'limit':'2000',
                        'exchange': exc}
        df = pull_data(request_dict)

        if full_df.empty:
            full_df = df
        else:
            full_df = pd.concat([full_df, df], axis=0)

In [7]:
full_df.head()

Unnamed: 0,close,exchange,fsym_tsym,high,low,open,time,volumefrom,volumeto
0,6490.26,COINBASE,BTC_USD,6538.46,6483.34,6517.35,1509541000.0,1051.74,6844393.26
1,6568.29,COINBASE,BTC_USD,6569.88,6490.25,6490.26,1509545000.0,894.93,5853403.05
2,6564.0,COINBASE,BTC_USD,6572.71,6544.99,6568.29,1509548000.0,754.48,4947295.99
3,6595.0,COINBASE,BTC_USD,6650.84,6564.0,6564.0,1509552000.0,1524.77,10069906.84
4,6614.79,COINBASE,BTC_USD,6623.77,6582.74,6595.0,1509556000.0,1118.03,7384463.91


And so, we see that running the script results in a nice set of tidy data! 

### Cleaning the Data

Fortunately for me, I did not encounter any issues while working with the CryptoCompare API. I did, however, find that not all of the data that I wanted was available at all times - for some cryptocurrencies, there was not market data available for one reason or another. However, I skirted past this issue by combining all of the market data that I had into a single set of features for each cryptocurrency/time pair. 

In [8]:
x = clean_data(full_df)

In [9]:
x.head()

Unnamed: 0_level_0,close,close,close,close,close,volume,volume,volume,volume,volume,fluctuation,fluctuation,fluctuation,fluctuation,fluctuation,relative_hl_close,relative_hl_close,relative_hl_close,relative_hl_close,relative_hl_close
fsym_tsym,BTC_USD,DASH_USD,ETH_USD,LTC_USD,XMR_USD,BTC_USD,DASH_USD,ETH_USD,LTC_USD,XMR_USD,BTC_USD,DASH_USD,ETH_USD,LTC_USD,XMR_USD,BTC_USD,DASH_USD,ETH_USD,LTC_USD,XMR_USD
time,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
1509541000.0,6493.282,274.15,298.23,54.75,85.693333,6299566.0,27401.126667,743920.7,251964.38,60606.503333,0.009024,0.012793,0.009008,0.008644,0.017765,0.271723,0.402282,0.427831,0.564744,0.463134
1509545000.0,6561.012,274.963333,300.3925,54.792,86.606667,5071424.0,70037.386667,1539056.0,239492.052,14172.056667,0.011651,0.014183,0.013494,0.009708,0.017504,0.933634,0.609077,0.545585,0.399962,0.840476
1509548000.0,6566.308,275.83,298.1975,54.7,86.4,4019494.0,22462.096667,1007136.0,199908.51,41916.22,0.006794,0.007647,0.013325,0.006023,0.010291,0.865263,0.667684,0.177458,0.457189,0.457143
1509552000.0,6589.582,273.74,297.3875,54.434,85.62,11872270.0,44476.77,772462.3,237270.852,29892.243333,0.012411,0.012763,0.006956,0.007237,0.013901,0.471245,0.187217,0.443205,0.210128,0.201881
1509556000.0,6607.36,273.713333,296.1325,54.42,85.396667,4988790.0,107364.553333,936163.9,244408.626,74598.423333,0.007219,0.014172,0.007625,0.004663,0.012535,0.677336,0.615474,0.32372,0.473985,0.271481


* For each closing price, I chose to take an average of the closing prices for all available markets for each cryptocurrency. (The fact that cryptocurrency values can vary by market blew my mind, but I found that they did not vary much). 
* For volume, I took the difference of volumeto and volumefrom and averaged across exchanges. It would have probably been more natural to take a sum here (to get a sense for how many coins were bought/sold in total), but since I was missing values here again, I figured an average would be the best that I could do. 
* You may notice from the script that I have also added a few additional features into the mix. I did not know if they would be useful or not, but my thought was that I would try to inject additional features to capture the volatility of the market. Here's a brief description of each additional feature:
    - fluctuation: measure of volatility; defined as the difference of the hourly high and low prices over the opening price (averaged over exchanges)
    - relative_hl_close: additional measure of volatility; min-max scaling of closing price using high/low prices (averaged over exchanges). A value of 1 means that the closing price was the same as the high price in that hour across all exchanges, whereas a 0 occurs when it closes at the low price across all exchanges.
* I did not have to worry about culling the data any further (i.e., look for outliers) after pulling the data from CryptoCompare - the API was very easy to use, and I did not see any funny business in the data that I grabbed.

### Conclusion

Overall, getting data for the project was not that difficult! There are a lot of (for-profit) services out there that charge a pretty penny for larger/more granular datasets, but I don't know that I need to bust out the big guns here.