---
## 2. Select subsets from our dataset


---

In [1]:
from digits.data import matimport
from digits.data import select

In [2]:
dataroot='../../data/thomas/artcorr/'
imp = matimport.Importer(dataroot=dataroot)

With `imp.open()` we can use HDF5 references to our samples and targets datasets without using up initial memory.  
The `samples` and `targets` objects are attached to the `store` attribute.

In this notebook we will load the samples and targets from the file right away.

In [3]:
imp.open('3131.h5')
samples = imp.store.samples
targets = imp.store.targets

In [4]:
670*16

10720

In [5]:
print(select.getsessionnames(samples))
for sess in select.getsessionnames(samples):
    print(samples.xs(sess, level='session').shape[0])

['01', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16']
632
650
652
652
683
687
669
658
610
672
609


The functions in `digits.data.select` will provide a high level abstraction for subselecting and pruning the large dataset, specific to the studies parameters. For instance:


#### column-wise

+ select only sampling points from a time window with `select.fromtimerange(samples, min, max)`
+ select all sampling points from a named list of channels with `select.fromchannellist(samples, list)`
+ select all sampling points from a range with `select.fromchannelrange(samples, min, max)`


#### row-wise

+ sellect all trials from a list of named session ids with `select.fromtrialid(samples, id-list)`
+ ...


Some helper functions inside the `select` package help to get the names of the index/column labels.


The idea is to incrementally reduce the dataset to the desired size and/or programmatically loop over a number of blocks in the dataset with a sliding window analysis in mind.

Example:

In [6]:
print(select.getchannelnames(samples))
print(select.getsessionnames(samples))
print(select.getpresentationnames(samples))

['A1', 'AF3', 'AF4', 'AF7', 'AF8', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'CP1', 'CP2', 'CP3', 'CP4', 'CP5', 'CP6', 'CPz', 'Cz', 'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'FC1', 'FC2', 'FC3', 'FC4', 'FC5', 'FC6', 'FCz', 'FT7', 'FT8', 'Fp1', 'Fp2', 'Fpz', 'Fz', 'IOL', 'LHEOG', 'O1', 'O2', 'Oz', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'PO3', 'PO4', 'PO7', 'PO8', 'POz', 'Pz', 'RHEOG', 'T7', 'T8', 'TP7', 'TP8']
['01', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16']
['0', '1', '10', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '12', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '13', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '14', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '15', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '16', '160', '161', '162', '163', '164', '16

In [7]:
print(select.getsessionnames(samples))

['01', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16']


The level/index names can be display with `head()` quite nicely:

In [8]:
samples.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,channel,A1,A1,A1,A1,A1,A1,A1,A1,A1,A1,...,TP8,TP8,TP8,TP8,TP8,TP8,TP8,TP8,TP8,TP8
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sample,t_0000,t_0001,t_0002,t_0003,t_0004,t_0005,t_0006,t_0007,t_0008,t_0009,...,t_1391,t_1392,t_1393,t_1394,t_1395,t_1396,t_1397,t_1398,t_1399,t_1400
subject,session,trial,presentation,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
3131,1,1,0,1.689615,-2.453192,-8.765578,-11.972319,-9.849177,-7.114474,-9.006877,-13.935571,-16.840109,-16.823076,...,15.935822,13.320987,9.83316,5.308144,0.260753,-3.780688,-5.544036,-5.443503,-5.029633,-4.876393
3131,1,1,1,4.328066,6.830514,2.813455,-2.613245,-3.789852,-0.751709,2.737862,4.809949,6.287606,6.930873,...,4.3338,3.705289,7.250734,10.510907,10.063163,6.407798,2.393856,0.804607,2.572249,6.067993
3131,1,1,2,8.778225,5.560692,1.972173,0.300447,0.200939,1.036754,1.537859,0.578754,-0.983312,-1.90757,...,7.796444,-7.895495,-16.413845,-11.682496,-2.190552,1.083942,-1.305389,-2.26005,-1.327338,-4.151634
3131,1,1,3,-0.492122,1.264945,5.40327,9.796667,13.128324,13.470269,8.742441,1.281598,-2.315708,0.684538,...,-21.384834,-15.984956,-6.571479,-1.409841,-4.616758,-11.22372,-12.282315,-3.693588,9.363037,15.343504
3131,1,2,4,-0.221068,0.586229,1.735736,0.739493,-3.061715,-6.123262,-6.32695,-5.859096,-6.358163,-6.213858,...,12.679294,12.537173,12.354441,13.66256,15.598869,16.17363,13.845768,8.383247,1.906764,-1.338092


Now for the selection:

In [9]:
print(samples.shape)
print(select.getsessionnames(samples))

(7174, 89664)
['01', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16']


In [10]:
samples, targets = select.fromsessionlist(samples, targets, ['14', '15'])
samples.shape

(1282, 89664)

In [11]:
samples = select.fromchannellist(samples, ['C1', 'C2'])
print(samples.shape)

(1282, 2802)


In [12]:
samples = select.fromtimerange(samples, 't_0200', 't_0201')
print(samples.shape)

(1282, 4)


In [13]:
samples, targets = select.frompresentationlist(samples, targets, ['1','2','3','4'])

In [14]:
samples.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,channel,C1,C1,C2,C2
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sample,t_0200,t_0201,t_0200,t_0201
subject,session,trial,presentation,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
3131,14,2,1,-7.291202,-8.3487,-10.118226,-11.385602
3131,14,2,2,5.475969,9.075162,9.528195,12.702423
3131,14,2,3,13.696177,14.089633,20.607351,20.646475
3131,14,2,4,3.592678,3.01983,5.601668,5.647057
3131,15,1,1,-0.402161,-0.086583,-2.192876,-2.454573
3131,15,1,2,-9.592463,-10.213227,-16.365511,-16.116913
3131,15,2,3,-4.315301,-3.262175,-3.72487,-2.534175
3131,15,2,4,3.713143,5.164251,7.71698,9.143118


In [15]:
print(samples.head(10).to_latex())

\begin{tabular}{llllrrrr}
\toprule
     &    &   &   &         C1 &            &         C2 &            \\
     &    &   &   &     t\_0200 &     t\_0201 &     t\_0200 &     t\_0201 \\
subject & session & trial & presentation &            &            &            &            \\
\midrule
3131 & 14 & 2 & 1 &  -7.291202 &  -8.348700 & -10.118226 & -11.385602 \\
     &    &   & 2 &   5.475969 &   9.075162 &   9.528195 &  12.702423 \\
     &    &   & 3 &  13.696177 &  14.089633 &  20.607351 &  20.646475 \\
     &    &   & 4 &   3.592678 &   3.019830 &   5.601668 &   5.647057 \\
     & 15 & 1 & 1 &  -0.402161 &  -0.086583 &  -2.192876 &  -2.454573 \\
     &    &   & 2 &  -9.592463 & -10.213227 & -16.365511 & -16.116913 \\
     &    & 2 & 3 &  -4.315301 &  -3.262175 &  -3.724870 &  -2.534175 \\
     &    &   & 4 &   3.713143 &   5.164251 &   7.716980 &   9.143118 \\
\bottomrule
\end{tabular}



doc: http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs