**README**: Information about notebook

This is a starter notebook that will be provided to you on competition day that provides boilerplate code to load all of the data (both the sample data and the data for each case as they go live). Since this notebook is mainly for you to get acquianted with loading/manipulating data, you will be able to access the sample data and data from any case. However on competition day, you will only be able to access data from the current case that is live. If you want to use a different statistical analysis framework, you can also download the zip files containing all the data that can be accessed from this notebook from the website: http://18.216.4.171:8080/ and write similar boilerplate code yourself. On the day of the competition, you will be able to access the sample data and data from each case as it goes live as a downloadable zip file.

In [46]:
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO
    
# Make sure you have the following libaries installed.
# Can be installed with `pip install` or `conda install`
# on terminal if you have pip or conda on your machine.
import pandas as pd
import requests

In [47]:
#This cell contains logic for 
base_url = 'http://18.216.4.171:8080'
import_url = '/data/'
sample_import_url = '/public/sample_data/'
submit_url = '/submissions/new'
def get_import_url(case_number, signal_name):
    return base_url + import_url + str(case_number) + '/' + signal_name

def get_sample_import_url(signal_name):
    return base_url + sample_import_url + signal_name + '.csv'

def get_submit_url():
    return base_url + submit_url

def get_signal_list():
    return ['A', 'B', 'C', 'D', 'E']

In [52]:
#This class helps load data (sample data and case data) 
#and also allows you to submit prediction intervals programmatically
class DayOf(object):
    
    def __init__(self, credentials):
        '''
        Initializes object with your team credentials
        Will not be able to submit data to our server if a valid
        set of credentials is not passed to this constructor
        '''
        self.team_credentials = credentials

    def load_data(self, case_number):
        '''
        Use this function to import data for case_number
        Note: will only be able to import data for a case while a case is active
        Alternatively, you can visit the URL given by `get_import_url(case_number)`
            and directly download the data for that case as a csv file
        Returns a dict with keys corresponding to signal names and values 
            as a pandas DataFrame
        The fields of each pandas DataFrame are
            time_step     (contains time step index,            integer)
            bid_size      (total number of bids,                float  )
            ask_size      (total number of asks,                float  )
            bid_exec      (number of executions at bid price,   float  )
            ask_exec      (number of executions at ask price,   float  )
            spread        (spread between bid/ask prices,       float  ) 
            price         (contains stock price,                float  ) 
        '''
        result_data = {}
        for signal_name in get_signal_list():
            request_data = requests.get(get_import_url(case_number, signal_name))
            if not request_data.status_code == 200:
                print('Error: cannot load data for ' + 
                      str(case_number) + signal_name + ' at this time')
                return result_data
            loaded_request_data = StringIO(request_data.text)
            result_data[signal_name] = pd.read_csv(loaded_request_data)

        return result_data
    
    def submit_data(self, case_number, signal_name, lower_bound, upper_bound):
        '''
        Parameters should be self-explanatory
        Use this method to submit your guesses to our server
        You can also do this directly on our website at `base_url`
        Will raise an error if not your data wasn't submitted
        '''
        submission_data = {
            'team_credentials': self.team_credentials,
            'submission_for':   case_number,
            'signal':           signal_name,
            'lower_bound':      lower_bound,
            'upper_bound':      upper_bound
        }
        request_data = requests.post(get_submit_url(), submission_data)
        request_datadict = request_data.json()
        if not request_datadict['status'] == 'success':
            raise RuntimeError(request_datadict['reason'])
        

In [53]:
# Initalizes a DayOf instance as D
D = DayOf('Insert your team credentials here')
#on the day of competition, you will need to insert valid credentials here, which will be emailed out to you,
#to submit your intervals (if you want to submit it programmatically)
#However to retrieve data only, you do not need valid credentials

Running the cell below will populate sample_data with the sample data on our server so you can build your models. For format information, see DayOf.load_data above. Note that for practice purposes, prices are not correlated with the features at all (so don't be surprised if all of your modeling attempts fail)! The point of this notebook is to help you get familiarized with getting/manipulating the data and submitting predictions.

In [54]:
sample_data = {}
for signal_name in get_signal_list():
    request_data = requests.get(get_sample_import_url(signal_name))
    if not request_data.status_code == 200:
        print('Error: cannot load data for ' + 
              signal_name + ' at this time')
        continue
    loaded_request_data = StringIO(request_data.text)
    sample_data[signal_name] = pd.read_csv(loaded_request_data)
    # You can convert this pandas dataframe into a numpy array
    # with sample_data[signal_name].values

To load data for a **case number** 1 and store in `case1_data`, do
```python
case1_data = D.load_data(1)
```

To submit **lower bound** = 5.1 and **upper bound** = 5.9 for **case number** 1 and **stock name** A, do
```python
D.submit_data(1, 'A', 5.1, 5.9)
```
Multiple submissions are acceptable. We will only consider your latest submission within the time limit of the case.
You can also submit your answers on a gui interface at `http://18.216.4.171:8080/`.

**PRACTICE ONLY**

Currently, you will be able to access all of the data from all of the cases (each distinct 5 minute period is referred to as a case). On competition day, you will only be able to access the sample data and data from the currently active case. Allowing you to view data from all the different cases will allow you to familiarize yourself with the format for the case data and submission format.

In [64]:
case1_data = D.load_data(1)
#there are 50 ticks of data for each stock, of which only 49 contain information
#notice that the last tick contains no values for the features and for the price of the stock,
#as the price is what you're trying to predict
print("Head for stock A")
print (case1_data['A'].head()) #show the top of the dataframe containing data for stock A in case 1
print("Tail for stock A")
print (case1_data['A'].tail()) #show the bottom of the dataframe containing data for stock A in case 1

Head for stock A
   time_step  bid_size  ask_size  bid_exec  ask_exec  spread   price
0       1000   95748.0   51629.0    6063.0    4331.0    0.06  109.39
1       1001  120294.0   42059.0   25921.0    1307.0    0.06  109.88
2       1002   93781.0   43366.0     579.0    4797.0    0.06  109.54
3       1003   90865.0   37192.0   11788.0    2756.0    0.07  109.94
4       1004   83042.0   44859.0    1477.0      95.0    0.07  109.89
Tail for stock A
    time_step  bid_size  ask_size  bid_exec  ask_exec  spread   price
45       1045   83648.0   85935.0    7773.0   14873.0    0.06  111.21
46       1046   91629.0   70959.0    2592.0    2456.0    0.08  110.98
47       1047   94729.0   63259.0   11533.0     205.0    0.07  111.27
48       1048   95408.0   61754.0    7134.0    9932.0    0.09  111.60
49       1049       NaN       NaN       NaN       NaN     NaN     NaN


An example of something one might do with the data is given below:

Say we're trying to train a model for stock A: we might want to regress the log return from time t-1 to time t with the features at time t. Obviously with this data, since the prices are randomly generated, so the log returns are normally distributed, there will be no correlation between the log returns and the signals. But in general, you might expect that relationships such as these will hold.

In [79]:
import numpy as np
A_prices = sample_data['A']['price'].values
A_log_returns = np.log(A_prices[1:]) - np.log(A_prices[:-1]) 
feature_cols = sample_data['A'].columns[1:6] #only columns 1-5 inclusive have relevant features
A_features = sample_data['A'][feature_cols].values[:-1] #don't want to include last row in features

from sklearn.linear_model import LinearRegression #may need to install this
# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
model = LinearRegression()
X_train = A_features[:-10] #features
y_train = A_log_returns[:-10] #what we want to predict
X_test = A_features[-10:]
y_test = A_log_returns[-10:]
model.fit(X_train, y_train)
print(model.score(X_test, y_test)) #print r^2 of predicting on X_test with true values
print('Wow our r^2 sucks because we\'re trying to predict random noise!')

0.0693524123371
Wow our r^2 sucks because we're trying to predict random noise!
