## Description

This is the second of five notebooks documenting a pipelined approach to out-of-core computation using Dask and a Stochastic Gradient Descent classifier available in Scikit-learn. Specifically, this notebook will split the raw Higgs data into two datasets: train and test. The training data will be used to fit a model while the test set will be used to check how well the model generalizes. Please be aware that fitting a model and testing it will happen in another notebook.

## Libraries

In [1]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split

## Versions

In [2]:
items = [("Pandas", pd), ("Sklearn", sklearn)]
for item in items:
    print(item[0] + " version: " + str(item[1].__version__))

Pandas version: 0.20.1
Sklearn version: 0.18.1


## Get Data
Make sure the **path** variable below is set correctly. Refer to the first notebook if you forgot where you saved the H5 file.

In [3]:
path = '/Users/davidziganto/data/'
X = pd.read_hdf(path + 'raw_HIGGS_data.h5', key='/a')
y = X.pop('label')

## Split Data

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Stitch training features and training target variable back into single DF
training = pd.concat([X_train, y_train], axis=1)
training.reset_index(drop=True, inplace=True)

# Stitch test features and test target variable back into single DF
test = pd.concat([X_test, y_test], axis=1)
test.reset_index(drop=True, inplace=True)

## Write Data To Disk
There is nothing to change here unless you want to save this data to somewhere other than the path indicated above. If that's the case, just remove path in the code below and set the path appropriately inside of the quotes.

In [5]:
training.to_hdf(path + 'raw_HIGGS_training_data.h5',
                format='table',
                key='/a',
                mode='w',
                append=False, 
                complevel=9, 
                complib='blosc',
                fletcher32=True)

In [6]:
test.to_hdf(path + 'raw_HIGGS_test_data.h5',
            format='table',
            key='/a',
            mode='w',
            append=False, 
            complevel=9, 
            complib='blosc',
            fletcher32=True)