Script to explore the data, create a model, and validate said model using the training data set.

In [1]:
#import libraries
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#%matplotlib inline

#read data
data = pd.read_csv('./numerai_training_data.csv')

#size and type, data
print('dimensions:')
print(data.shape)
print('rows:')
nrows=data.shape[0]
print(nrows)
print('columns:')
ncols=data.shape[1]
print(ncols) 

dimensions:
(535713, 54)
rows:
535713
columns:
54


In [2]:
#preview data
data.head()

Unnamed: 0,id,era,data_type,feature1,feature2,feature3,feature4,feature5,feature6,feature7,...,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50,target
0,nb5059fbc40534a1,era1,train,0.49282,0.58077,0.48948,0.56762,0.56107,0.51168,0.47459,...,0.40333,0.52337,0.60795,0.35748,0.49677,0.28295,0.65342,0.57915,0.51136,1
1,nb2bb43f474f2429,era1,train,0.53427,0.72712,0.61895,0.66238,0.47163,0.66996,0.37673,...,0.46847,0.44557,0.64831,0.28245,0.68554,0.26547,0.59322,0.53156,0.61621,0
2,n1e960207daad44a,era1,train,0.54888,0.46304,0.49582,0.52395,0.57362,0.46969,0.50229,...,0.74065,0.41003,0.4323,0.78286,0.52214,0.43961,0.46139,0.61272,0.72566,1
3,n5e99b4326e6f463,era1,train,0.64488,0.56167,0.72591,0.52219,0.49311,0.51511,0.45514,...,0.67751,0.4334,0.67009,0.50086,0.51208,0.58674,0.54358,0.58602,0.51818,0
4,nf454131816e5401,era1,train,0.45235,0.56569,0.54424,0.34145,0.67652,0.44318,0.45627,...,0.40709,0.58624,0.44531,0.66276,0.41992,0.58741,0.62276,0.31212,0.22357,0


From the preview above, the data is identified as a labeled training set; this is a supervised learning problem.

In [3]:
#data types
data.dtypes

id            object
era           object
data_type     object
feature1     float64
feature2     float64
feature3     float64
feature4     float64
feature5     float64
feature6     float64
feature7     float64
feature8     float64
feature9     float64
feature10    float64
feature11    float64
feature12    float64
feature13    float64
feature14    float64
feature15    float64
feature16    float64
feature17    float64
feature18    float64
feature19    float64
feature20    float64
feature21    float64
feature22    float64
feature23    float64
feature24    float64
feature25    float64
feature26    float64
feature27    float64
feature28    float64
feature29    float64
feature30    float64
feature31    float64
feature32    float64
feature33    float64
feature34    float64
feature35    float64
feature36    float64
feature37    float64
feature38    float64
feature39    float64
feature40    float64
feature41    float64
feature42    float64
feature43    float64
feature44    float64
feature45    

There are 50 features; their data type is float. The target is an integer. There are 3 other variables (id, era, data_type) typed object.

In [4]:
#check uniqueness of values, check for missing values and nans

#id
print('ID')
print('unique values:')
print(data.id.unique())
print('number of unique values:')
print(data.id.nunique())
print('number of rows in dataset:')
print(nrows)

ID
unique values:
['nb5059fbc40534a1' 'nb2bb43f474f2429' 'n1e960207daad44a' ...,
 'nc74ef5d1017b4f9' 'n47e612126c9b405' 'ncd4237ff29cb422']
number of unique values:
535713
number of rows in dataset:
535713


There is a unique ID for each row in the dataset.

In [12]:
#era
print('ERA')
print('unique values:')
print(data.era.unique())
print('number of unique values:')
print(data.era.nunique())
print('number of rows in dataset:')
print(nrows)
print('number of value counts:')
print(data.era.value_counts())

ERA
unique values:
['era1' 'era2' 'era3' 'era4' 'era5' 'era6' 'era7' 'era8' 'era9' 'era10'
 'era11' 'era12' 'era13' 'era14' 'era15' 'era16' 'era17' 'era18' 'era19'
 'era20' 'era21' 'era22' 'era23' 'era24' 'era25' 'era26' 'era27' 'era28'
 'era29' 'era30' 'era31' 'era32' 'era33' 'era34' 'era35' 'era36' 'era37'
 'era38' 'era39' 'era40' 'era41' 'era42' 'era43' 'era44' 'era45' 'era46'
 'era47' 'era48' 'era49' 'era50' 'era51' 'era52' 'era53' 'era54' 'era55'
 'era56' 'era57' 'era58' 'era59' 'era60' 'era61' 'era62' 'era63' 'era64'
 'era65' 'era66' 'era67' 'era68' 'era69' 'era70' 'era71' 'era72' 'era73'
 'era74' 'era75' 'era76' 'era77' 'era78' 'era79' 'era80' 'era81' 'era82'
 'era83' 'era84' 'era85']
number of unique values:
85
number of rows in dataset:
535713
number of value counts:
era34    6793
era25    6757
era24    6749
era31    6745
era26    6743
era32    6734
era27    6717
era33    6712
era28    6703
era29    6685
era30    6679
era23    6570
era36    6548
era35    6548
era71    6473
era

The data are grouped into 85 distinct eras. The number of observations in each era range from 5927-6793.

In [13]:
#data type
print('DATA TYPE')
print('unique values:')
print(data.data_type.unique())
print('number of unique values:')
print(data.data_type.nunique())

DATA TYPE
unique values:
['train']
number of unique values:
1


All data are training data.

In [11]:
#target
print('TARGET')
print('unique values:')
print(data.target.unique())
print('number of unique values:')
print(data.target.nunique())
print('number of value counts:')
print(data.target.value_counts())

TARGET
unique values:
[1 0]
number of unique values:
2
number of value counts:
0    267878
1    267835
Name: target, dtype: int64


There are 2 distinct target values, 0 and 1. Because the target values are discrete, this is a classification problem. Based on the value counts, the data appear pretty evenly distributed between 0 and 1.

In [7]:
#features
#distributions of each of the 50 features
data.feature1.describe()

count    535713.000000
mean          0.472921
std           0.113607
min           0.000000
25%           0.392630
50%           0.467900
75%           0.548840
max           0.982560
Name: feature1, dtype: float64

In [8]:
data.feature2.describe()

count    535713.000000
mean          0.482357
std           0.117309
min           0.000000
25%           0.401900
50%           0.481960
75%           0.561510
max           1.000000
Name: feature2, dtype: float64