# Staging Crypto Orderbook Features for Training With IV Labs

This is a brief example notebook demonstrating how to add and index new data sets from your local machine using the IV Labs client.

First let's import the appropriate libraries and check if we have any feature servers up and running to try adding data to:

In [3]:
import sys
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import json
import os
from feature_server.historical.task import get_running_feature_server_urls

# Let's print out the feature servers that are running
feature_server_urls = get_running_feature_server_urls()
print feature_server_urls



['af605267f3a1111e8b3210ab26c3cd17-1311831915.us-east-1.elb.amazonaws.com']




### Creating a new server if necessary
If there is no feature server running (if the command above returns no valid server addresses), we will run a feature server task then wait for the feature server to setup and get assigned an ip (This step might take some time (particularly IP assignment if you happen to be using azure), you can check the status of the external-ip assignment via kubectl get services and the status of the server itself via kubectl get pods).

In [None]:
# If there is no feature server URL listed above spin one up using the commented command
from feature_server.historical.task import FeatureServerTask
f = FeatureServerTask()
f.run()

### Creating a client to verify connection
Finally, once the feature server is given a url you can create a client feature server

In [4]:
from feature_server.historical.client import Client
# Once we can see a feature server running in the URL list let's conenct to it
feature_server_client = Client(feature_server_urls[0])

Now that we have our feature server running and our client attached, we can register a new dataset on the server.

To do this we need to give the data set a unqiue identifier (that muct match the sql specification for table name e.g. no dashes, special charecters etc.) as well as a fixed schema for the columns in the data set (in line with the standard [SQL Schema](https://www.postgresql.org/docs/9.5/static/datatype.html)), and finally we need to give the order of the columns to the feature server.

For now only structured relational data can be registered and indexed by the feature server but we hope to change that soon.

In [1]:
feature_headers = ['time','sign','dp','c_csm_3','j_csm_3','h_csm_3','s_csm_3_dummy_5','s_csm_3_dummy_6','s_csm_3_dummy_7',
 's_csm_3_dummy_10','s_csm_3_dummy_14','s_csm_3_dummy_15','s_csm_3_dummy_16','s_csm_3_dummy_17','s_csm_3_dummy_18',
 's_csm_3_dummy_19','s_csm_3_dummy_20','s_csm_3_dummy_21','s_csm_3_dummy_22','s_csm_3_dummy_23','s_csm_3_dummy_24',
 's_csm_3_dummy_25','s_csm_3_dummy_26','s_csm_3_dummy_27','s_csm_3_dummy_28','s_csm_3_dummy_29','s_csm_3_dummy_30',
 's_csm_3_dummy_31','s_csm_3_dummy_32','sign_2','sign_3','sign_4','sign_5','sign_6','sign_7','sign_8','sign_9',
 'sign_10','dp_lag_2','dp_lag_3','dp_lag_4','dp_lag_5','dp_lag_6','dp_lag_7','dp_lag_8','dp_lag_9','dp_lag_10',
 'j_csm_3_lag_2','c_csm_3_lag_2','h_csm_3_lag_2','j_csm_3_lag_3','c_csm_3_lag_3','h_csm_3_lag_3','j_csm_3_lag_4',
 'c_csm_3_lag_4','h_csm_3_lag_4','j_csm_3_lag_5','c_csm_3_lag_5','h_csm_3_lag_5','j_csm_3_lag_6','c_csm_3_lag_6',
 'h_csm_3_lag_6','j_csm_3_lag_7','c_csm_3_lag_7','h_csm_3_lag_7','j_csm_3_lag_8','c_csm_3_lag_8','h_csm_3_lag_8',
 'j_csm_3_lag_9','c_csm_3_lag_9','h_csm_3_lag_9','j_csm_3_lag_10','c_csm_3_lag_10','h_csm_3_lag_10',
 'sign_-1_L_10_count','sign_-1_L_10_repetition','sign_0_L_10_count','sign_0_L_10_repetition','sign_1_L_10_count',
 'sign_1_L_10_repetition']

In [None]:
feature_types = {x: 'double precision' for x in feature_headers}
feature_types['sign'] = 'int'
feature_types['time'] = 'varchar(255)'

In [None]:
feature_server_client.register_dataset('signprocess',  
                                       feature_types,
                                       feature_headers)

Now that we have registered the data set we can stream data from a local file to append to the data set in the feature server using the `add_to_dataset` method. For now we only support csv files. When adding data to the feature server make sure the columns of these files are in the same order you indicated when creating the feature set above. Furthermore be sure to correctly indicate weather or not the csv file contains a header.

In [7]:
from common.file_system.file_system import fs

files = fs.list_files('features/v0/time_bars_3s', delimiter='')

for f in files:
    fs.download_file(f, 'dump/')
    name = f.split('/')[-1]
    feature_server_client.add_to_dataset('signprocess', file_path, file_type='csv', header=True)

{u'success': True}

Once you run this the feature data should also be available as flat files in the dump/ directory that was generated in the same folder as this notebook. If you want to model on this data outside of IV Lab, you can download it directly from this interface to Jupyter hub.

It's important to remember all new data is appended to the existing dataset on the feature server when using the `add_to_dataset` method.

Now that we have created the data set on the feature server and added some custom data, let's ensure everything has been indexed and loaded correctly by printing the number of rows indexed (should be 4 if you just loaded example.csv) as well as the schema.

In [9]:
# Let's check the dataset schema
print feature_server_client.query_for_dataset_schema('signprocess')

# Let's see how many rows are in the data sets
print feature_server_client.query_for_dataset_size('signprocess')

[[u't', u'integer'], [u'name', u'character varying'], [u'age', u'integer'], [u'gpa', u'double precision']]
4


Now we can keep adding from other files with the same column structure to this dataset or start loading it into arbitrary jobs running from IV Labs