# In this notebook we continue to investigate "btrotta".   
Here we are walking the code (reverse engineering "calculate_features.py")

* I want to summarize this first (from her .pdf)
* uses LightGBM (linear gradient boosting classification tree)
* uses elementary operations to calculate features (no curve fitting - keeps runtime down)
  * Bayesian approach to removing noise
  * adding features based on scaled flux values
  * adding features to capture behavior around the peak
  * understanding how to optimize the metrix
  
  training 
prediction


In [1]:
import pandas as pd
import numpy as np
import gc
import os


In [6]:
import datetime
print(datetime.datetime.now())
print("reading data")

2019-10-12 10:48:11.628547
reading data


In [7]:
# read data
col_dict = {'mjd': np.float64, 'flux': np.float32, 'flux_err': np.float32, 'object_id': np.int32, 'passband': np.int8,
            'detected': np.int8}
train_meta = pd.read_csv(os.path.join('data', 'training_set_metadata.csv'))
train = pd.read_csv(os.path.join('data', 'training_set.csv'), dtype=col_dict)


In [8]:
test_meta = pd.read_csv(os.path.join('data', 'test_set_metadata.csv'))
test = pd.read_csv(os.path.join('data', 'test_set_sample.csv'), dtype=col_dict)

In [9]:
all_meta = pd.concat([train_meta, test_meta], axis=0, ignore_index=True, sort=True).reset_index()
all_meta.drop('index', axis=1, inplace=True)

In [10]:
print(train_meta.shape)
print(train.shape)
print(test_meta.shape)
print(test.shape)
print(all_meta.shape)

(7848, 12)
(1421705, 6)
(3492890, 11)
(1000000, 6)
(3500738, 12)


* hmmm the "training" dataset has 180 samples / object
* the "test" dataset has 0.3 samples / object
* I'm wondering how many objects are actually in the "test" dataset (does meta-data list objects not in test?)

In [11]:
train_meta.head()

Unnamed: 0,object_id,ra,decl,gal_l,gal_b,ddf,hostgal_specz,hostgal_photoz,hostgal_photoz_err,distmod,mwebv,target
0,615,349.046051,-61.943836,320.79653,-51.753706,1,0.0,0.0,0.0,,0.017,92
1,713,53.085938,-27.784405,223.525509,-54.460748,1,1.8181,1.6267,0.2552,45.4063,0.007,88
2,730,33.574219,-6.579593,170.455585,-61.548219,1,0.232,0.2262,0.0157,40.2561,0.021,42
3,745,0.189873,-45.586655,328.254458,-68.969298,1,0.3037,0.2813,1.1523,40.7951,0.007,90
4,1124,352.711273,-63.823658,316.922299,-51.059403,1,0.1934,0.2415,0.0176,40.4166,0.024,90


In [12]:
train_meta["object_id"].unique().shape

(7848,)

In [13]:
train["object_id"].unique().shape

(7848,)

In [14]:
test_meta["object_id"].unique().shape

(3492890,)

In [15]:
test["object_id"].unique().shape

(3036,)

# so it is confirmed - the "test" timeseries dataset actually has 3036 objects
* The test_meta has many object_ids not in the test timeseries data
* The test dataset has 330 samples/object


# Feature Engineering (Section 1 in the .pdf) 
* in the code this is implemented as "calculate_features.py"

In [16]:
all_data = train.copy()

In [17]:

    # Normalise the flux, following the Bayesian approach here:
    # https://www.statlect.com/fundamentals-of-statistics/normal-distribution-Bayesian-estimation
    # Similar idea (but not the same) as the normalisation done in the Starter Kit
    # https://www.kaggle.com/michaelapers/the-plasticc-astronomy-starter-kit?scriptVersionId=6040398
    prior_mean = all_data.groupby(['object_id', 'passband'])['flux'].transform('mean')
    prior_std = all_data.groupby(['object_id', 'passband'])['flux'].transform('std')
    prior_std.loc[prior_std.isnull()] = all_data.loc[prior_std.isnull(), 'flux_err']
    obs_std = all_data['flux_err']  # since the above kernel tells us that the flux error is the 68% confidence interval
 

In [18]:
    all_data['bayes_flux'] = (all_data['flux'] / obs_std**2 + prior_mean / prior_std**2) \
                             / (1 / obs_std**2 + 1 / prior_std**2)

In [19]:
    all_data.loc[all_data['bayes_flux'].notnull(), 'flux'] \
        = all_data.loc[all_data['bayes_flux'].notnull(), 'bayes_flux']

In [20]:
    # Estimate the flux at source, using the fact that light is proportional
    # to inverse square of distance from source.
    # This is hinted at here: https://www.kaggle.com/c/PLAsTiCC-2018/discussion/70725#417195
    redshift = all_meta.set_index('object_id')[['hostgal_specz', 'hostgal_photoz']]
    
    redshift['redshift'] = redshift['hostgal_specz']
    redshift.loc[redshift['redshift'].isnull(), 'redshift'] \
        = redshift.loc[redshift['redshift'].isnull(), 'hostgal_photoz']

In [21]:
    all_data = pd.merge(all_data, redshift, 'left', 'object_id')
    nonzero_redshift = all_data['redshift'] > 0
    all_data.loc[nonzero_redshift, 'flux'] = all_data.loc[nonzero_redshift, 'flux'] \
                                             * all_data.loc[nonzero_redshift, 'redshift']**2

In [22]:
all_data

Unnamed: 0,object_id,mjd,passband,flux,flux_err,detected,bayes_flux,hostgal_specz,hostgal_photoz,redshift
0,615,59750.4229,2,-544.784302,3.622952,1,-544.784302,0.0,0.0,0.0
1,615,59750.4306,1,-816.397644,5.553370,1,-816.397644,0.0,0.0,0.0
2,615,59750.4383,3,-471.340546,3.801213,1,-471.340546,0.0,0.0,0.0
3,615,59750.4450,4,-388.477905,11.395031,1,-388.477905,0.0,0.0,0.0
4,615,59752.4070,2,-681.815735,4.041204,1,-681.815735,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
1421700,130779836,60555.9838,4,-39.819046,46.477093,0,-39.819046,0.0,0.0,0.0
1421701,130779836,60560.0459,1,15.072200,18.947685,0,15.072200,0.0,0.0,0.0
1421702,130779836,60571.0225,5,30.733459,50.695290,0,30.733459,0.0,0.0,0.0
1421703,130779836,60585.9974,4,-23.413193,44.819859,0,-23.413193,0.0,0.0,0.0


In [23]:
    # aggregate features
    band_aggs = all_data.groupby(['object_id', 'passband'])['flux'].agg(['mean', 'std', 'max', 'min']).unstack(-1)
    band_aggs.columns = [x + '_' + str(y) for x in band_aggs.columns.levels[0]
                          for y in band_aggs.columns.levels[1]]
    all_data.sort_values(['object_id', 'passband', 'flux'], inplace=True)

In [24]:
band_aggs

Unnamed: 0_level_0,mean_0,mean_1,mean_2,mean_3,mean_4,mean_5,std_0,std_1,std_2,std_3,...,max_2,max_3,max_4,max_5,min_0,min_1,min_2,min_3,min_4,min_5
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
615,-3.280875,-385.686066,-134.144547,-121.101494,-55.947735,-47.465115,83.769569,601.738953,455.090881,335.389160,...,611.929565,445.658356,381.876129,377.994080,-116.758644,-1100.351196,-681.815735,-530.595459,-422.112610,-422.530182
713,-9.234099,-3.436403,-2.644122,-3.357972,-3.113960,-6.039946,21.060951,17.820436,18.172571,20.005138,...,31.528498,33.711967,29.996479,23.188408,-44.869629,-38.005630,-32.812405,-39.487026,-37.800091,-36.686817
730,-0.002671,0.004307,0.128504,0.174134,0.232075,0.250948,0.041101,0.046259,0.281126,0.418759,...,1.095721,1.719861,2.127217,2.192224,-0.095862,-0.094754,-0.123092,-0.256831,-0.274332,-0.558706
745,0.157937,0.527264,0.895451,1.329056,1.227214,0.964163,0.320467,2.387648,2.941722,3.217043,...,20.322405,18.697315,16.823721,11.039402,-0.287152,-0.329730,-0.198360,-0.451884,-0.480905,-0.826227
1124,0.026324,0.171124,0.382707,0.414283,0.371966,0.255909,0.041623,0.285635,0.792699,0.976086,...,3.968275,5.194155,5.305183,3.692601,-0.080758,-0.074009,-0.076513,-0.101970,-0.489360,-0.368371
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130739978,1.422760,11.592505,1.599211,4.687986,-1.693461,19.014822,13.282038,41.883793,8.692912,4.975391,...,35.503822,18.029779,14.617792,430.005585,-19.813738,-3.701869,-5.716228,-3.713967,-21.628431,-82.479317
130755807,0.725015,0.205469,-0.021201,0.805642,1.222313,0.091906,2.244568,0.455759,0.052020,2.578640,...,0.050795,9.654325,10.642774,2.014629,-0.543207,-0.054628,-0.096152,-0.891813,-1.173656,-0.700877
130762946,0.031967,-19.712677,-28.527233,-14.739201,-15.416134,-10.772908,14.974207,15.471964,25.912512,25.454903,...,21.063087,38.645321,5.648650,49.731503,-36.691898,-61.140427,-72.320488,-60.707165,-31.057060,-57.803970
130772921,3.157643,28.301941,0.727620,-0.588218,-0.782045,8.299319,9.124571,97.320076,5.545393,2.962431,...,23.823160,7.349889,7.779712,153.784729,-7.293675,-5.249303,-5.461375,-5.472766,-8.767235,-32.409897


In [25]:
    # this way of calculating quantiles is faster than using the pandas quantile builtin on the groupby object
    all_data['group_count'] = all_data.groupby(['object_id', 'passband']).cumcount()
    all_data['group_size'] = all_data.groupby(['object_id', 'passband'])['flux'].transform('size')
    q_list = [0.25, 0.75]
    for q in q_list:
        all_data['q_' + str(q)] = all_data.loc[
            (all_data['group_size'] * q).astype(int) == all_data['group_count'], 'flux']
    quantiles = all_data.groupby(['object_id', 'passband'])[['q_' + str(q) for q in q_list]].max().unstack(-1)
    quantiles.columns = [str(x) + '_' + str(y) + '_quantile' for x in quantiles.columns.levels[0]
                         for y in quantiles.columns.levels[1]]


In [26]:
    # max detected flux
    max_detected = all_data.loc[all_data['detected'] == 1].groupby('object_id')['flux'].max().to_frame('max_detected')


In [27]:
max_detected

Unnamed: 0_level_0,max_detected
object_id,Unnamed: 1_level_1
615,660.555237
713,33.711967
730,2.192224
745,20.322405
1124,5.305183
...,...
130739978,430.005585
130755807,10.642774
130762946,-35.704021
130772921,321.631409


In [28]:
    def most_extreme(df_in, k, positive=True, suffix='', include_max=True, include_dur=True, include_interval=False):
        # find the "most extreme" time for each object, and for each band, retrieve the k data points on either side
        # k points before
        df = df_in.copy()
        df['object_passband_mean'] = df.groupby(['object_id', 'passband'])['flux'].transform('median')
        if positive:
            df['dist_from_mean'] = (df['flux'] - df['object_passband_mean'])
        else:
            df['dist_from_mean'] = -(df['flux'] - df['object_passband_mean'])

        max_time = df.loc[df['detected'] == 1].groupby('object_id')['dist_from_mean'].idxmax().to_frame(
            'max_ind')
        max_time['mjd_max' + suffix] = df.loc[max_time['max_ind'].values, 'mjd'].values
        df = pd.merge(df, max_time[['mjd_max' + suffix]], 'left', left_on=['object_id'], right_index=True)
        df['time_after_mjd_max'] = df['mjd'] - df['mjd_max' + suffix]
        df['time_before_mjd_max'] = -df['time_after_mjd_max']

        # first k after event
        df.sort_values(['object_id', 'passband', 'time_after_mjd_max'], inplace=True)
        df['row_num_after'] = df.loc[df['time_after_mjd_max'] >= 0].groupby(
            ['object_id', 'passband']).cumcount()
        first_k_after = df.loc[(df['row_num_after'] < k) & (df['time_after_mjd_max'] <= 50),
                              ['object_id', 'passband', 'flux', 'row_num_after']]
        first_k_after.set_index(['object_id', 'passband', 'row_num_after'], inplace=True)
        first_k_after = first_k_after.unstack(level=-1).unstack(level=-1)
        first_k_after.columns = [str(x) + '_' + str(y) + '_after' for x in first_k_after.columns.levels[1]
                                 for y in first_k_after.columns.levels[2]]
        extreme_data = first_k_after
        time_bands = [[-50, -20], [-20, -10], [-10, 0], [0, 10], [10, 20], [20, 50], [50, 100], [100, 200], [200, 500]]
        if include_interval:
            interval_arr = []
            for start, end in time_bands:
                band_data = df.loc[(start <= df['time_after_mjd_max']) & (df['time_after_mjd_max'] <= end)]
                interval_agg = band_data.groupby(['object_id', 'passband'])['flux'].mean().unstack(-1)
                interval_agg.columns = ['{}_start_{}_end_{}'.format(c, start, end) for c in interval_agg.columns]
                interval_arr.append(interval_agg)
            interval_data = pd.concat(interval_arr, axis=1)
            extreme_data = pd.concat([extreme_data, interval_data], axis=1)
        if include_dur:
            # detection duration in each passband after event
            duration_after = df.loc[(df['time_after_mjd_max'] >= 0) & (df['detected'] == 0)] \
                .groupby(['object_id', 'passband'])['time_after_mjd_max'].first().unstack(-1)
            duration_after.columns = ['dur_after_' + str(c) for c in range(6)]
            extreme_data = pd.concat([extreme_data, duration_after], axis=1)

        # last k before event
        df.sort_values(['object_id', 'passband', 'time_before_mjd_max'], inplace=True)
        df['row_num_before'] = df.loc[df['time_before_mjd_max'] >= 0].groupby(
            ['object_id', 'passband']).cumcount()
        first_k_before = df.loc[(df['row_num_before'] < k) & (df['time_after_mjd_max'] <= 50),
                                ['object_id', 'passband', 'flux', 'row_num_before']]
        first_k_before.set_index(['object_id', 'passband', 'row_num_before'], inplace=True)
        first_k_before = first_k_before.unstack(level=-1).unstack(level=-1)
        first_k_before.columns = [str(x) + '_' + str(y) + '_before' for x in first_k_before.columns.levels[1]
                                  for y in first_k_before.columns.levels[2]]
        extreme_data = pd.concat([extreme_data, first_k_before], axis=1)
        if include_dur:
            # detection duration in each passband before event
            duration_before = df.loc[(df['time_before_mjd_max'] >= 0) & (df['detected'] == 0)] \
                .groupby(['object_id', 'passband'])['time_before_mjd_max'].first().unstack(-1)
            duration_before.columns = ['dur_before_' + str(c) for c in range(6)]
            extreme_data = pd.concat([extreme_data, duration_before], axis=1)

        if include_max:
            # passband with maximum detected flux for each object
            max_pb = df.loc[max_time['max_ind'].values].groupby('object_id')['passband'].max().to_frame(
                'max_passband')
            # time of max in each passband, relative to extreme max
            band_max_ind = df.groupby(['object_id', 'passband'])['flux'].idxmax()
            band_mjd_max = df.loc[band_max_ind.values].groupby(['object_id', 'passband'])['mjd'].max().unstack(-1)
            cols = ['max_time_' + str(i) for i in range(6)]
            band_mjd_max.columns = cols
            band_mjd_max = pd.merge(band_mjd_max, max_time, 'left', 'object_id')
            for c in cols:
                band_mjd_max[c] -= band_mjd_max['mjd_max' + suffix]
            band_mjd_max.drop(['mjd_max' + suffix, 'max_ind'], axis=1, inplace=True)
            extreme_data = pd.concat([extreme_data, max_pb, band_mjd_max], axis=1)

        extreme_data.columns = [c + suffix for c in extreme_data.columns]
        return extreme_data

In [29]:
    extreme_max = most_extreme(all_data, 1, positive=True, suffix='', include_max=True, include_dur=True,
                               include_interval=True)

In [30]:
    extreme_min = most_extreme(all_data, 1, positive=False, suffix='_min', include_max=False, include_dur=True)


In [31]:
extreme_max.columns

Index(['0.0_0_after', '0.0_1_after', '0.0_2_after', '0.0_3_after',
       '0.0_4_after', '0.0_5_after', '0_start_-50_end_-20',
       '1_start_-50_end_-20', '2_start_-50_end_-20', '3_start_-50_end_-20',
       '4_start_-50_end_-20', '5_start_-50_end_-20', '0_start_-20_end_-10',
       '1_start_-20_end_-10', '2_start_-20_end_-10', '3_start_-20_end_-10',
       '4_start_-20_end_-10', '5_start_-20_end_-10', '0_start_-10_end_0',
       '1_start_-10_end_0', '2_start_-10_end_0', '3_start_-10_end_0',
       '4_start_-10_end_0', '5_start_-10_end_0', '0_start_0_end_10',
       '1_start_0_end_10', '2_start_0_end_10', '3_start_0_end_10',
       '4_start_0_end_10', '5_start_0_end_10', '0_start_10_end_20',
       '1_start_10_end_20', '2_start_10_end_20', '3_start_10_end_20',
       '4_start_10_end_20', '5_start_10_end_20', '0_start_20_end_50',
       '1_start_20_end_50', '2_start_20_end_50', '3_start_20_end_50',
       '4_start_20_end_50', '5_start_20_end_50', '0_start_50_end_100',
       '1_start_

In [32]:
    # add the feature mentioned here, attempts to identify periodicity:
    # https://www.kaggle.com/c/PLAsTiCC-2018/discussion/69696#410538
    time_between_detections = all_data.loc[all_data['detected'] == 1].groupby('object_id')['mjd'].agg(['max', 'min'])
    time_between_detections['det_period'] = time_between_detections['max'] - time_between_detections['min']
 

In [33]:
    # same feature but grouped by passband
    time_between_detections_pb \
        = all_data.loc[all_data['detected'] == 1].groupby(['object_id', 'passband'])['mjd'].agg(['max', 'min'])
    time_between_detections_pb['det_period'] = time_between_detections_pb['max'] - time_between_detections_pb['min']
    time_between_detections_pb = time_between_detections_pb['det_period'].unstack(-1)
    time_between_detections_pb.columns = ['det_period_pb_' + str(i) for i in range(6)]

In [34]:
    # similar feature based on high values
    all_data['threshold'] = all_data.groupby(['object_id'])['flux'].transform('max') * 0.75
    all_data['high'] = ((all_data['flux'] >= all_data['threshold']) & (all_data['detected'] == 1)).astype(int)
    time_between_highs = all_data.loc[all_data['high'] == 1].groupby('object_id')['mjd'].agg(['max', 'min'])
    time_between_highs['det_period_high'] = time_between_highs['max'] - time_between_highs['min']

In [35]:
    # aggregate values of the features during the detection period
    all_data = pd.merge(all_data, time_between_detections, 'left', 'object_id')
    det_data = all_data.loc[(all_data['mjd'] >= all_data['min']) & (all_data['mjd'] <= all_data['max'])]
    det_aggs = det_data.groupby(['object_id', 'passband'])['flux'].agg(['min', 'max', 'std', 'median'])
    det_aggs['prop_detected'] = det_data.groupby(['object_id', 'passband'])['detected'].mean()
    det_aggs = det_aggs.unstack(-1)
    det_aggs.columns = [x + '_' + str(y) + '_det_period' for x in det_aggs.columns.levels[0]
                          for y in det_aggs.columns.levels[1]]

In [36]:
    # time distribution of detections in each band
    detection_time_dist \
        = all_data.loc[all_data['detected'] == 1].groupby(['object_id', 'passband'])['mjd'].std().unstack(-1)
    detection_time_dist.columns = ['time_dist_' + str(i) for i in range(6)]
    detection_time_dist_all \
        = all_data.loc[all_data['detected'] == 1].groupby(['object_id'])['mjd'].std().to_frame('time_dist')


In [37]:

    # scale data and recalculate band aggs
    all_data['abs_flux'] = all_data['flux'].abs()
    all_data['flux'] = (all_data['flux']) / all_data.groupby('object_id')['abs_flux'].transform('max')
    band_aggs_s = all_data.groupby(['object_id', 'passband'])['flux'].agg(['mean', 'std', 'max', 'min']).unstack(-1)
    band_aggs_s.columns = [x + '_' + str(y) + '_scaled' for x in band_aggs_s.columns.levels[0]
                          for y in band_aggs_s.columns.levels[1]]
    all_data.sort_values(['object_id', 'passband', 'flux'], inplace=True)
    for q in q_list:
        all_data['q_' + str(q)] = all_data.loc[
            (all_data['group_size'] * q).astype(int) == all_data['group_count'], 'flux']
    quantiles_s = all_data.groupby(['object_id', 'passband'])[['q_' + str(q) for q in q_list]].max().unstack(-1)
    quantiles_s.columns = [str(x) + '_' + str(y) + '_quantile_s' for x in quantiles_s.columns.levels[0]
                          for y in quantiles_s.columns.levels[1]]

    extreme_max_s = most_extreme(all_data, 1, positive=True, suffix='_s', include_max=False, include_dur=False,
                                 include_interval=True)
    extreme_min_s = most_extreme(all_data, 1, positive=False, suffix='_min_s', include_max=False, include_dur=False)


In [38]:

    new_data = pd.concat([band_aggs, quantiles, band_aggs_s, max_detected, time_between_detections[['det_period']],
                          time_between_detections_pb, extreme_max, extreme_min, extreme_max_s, extreme_min_s,
                          time_between_highs[['det_period_high']], quantiles_s, detection_time_dist,
                          detection_time_dist_all, det_aggs], axis=1)

In [40]:
new_data.shape

(7848, 305)

# thus there are 305 features
* in summary -
* start with "metadata" and "timeseries" data
* the "metadata" is based on object_id, and is the destination for all generated features
* we generate features, mostly using the timeseries data
  * (coalesce the timeseries data into features, to add to the metadata)
* in the end we go from (7848,12) to (7848, 305)

In [41]:
new_data.columns

Index(['mean_0', 'mean_1', 'mean_2', 'mean_3', 'mean_4', 'mean_5', 'std_0',
       'std_1', 'std_2', 'std_3',
       ...
       'median_2_det_period', 'median_3_det_period', 'median_4_det_period',
       'median_5_det_period', 'prop_detected_0_det_period',
       'prop_detected_1_det_period', 'prop_detected_2_det_period',
       'prop_detected_3_det_period', 'prop_detected_4_det_period',
       'prop_detected_5_det_period'],
      dtype='object', length=305)

# she then repeats this using "approx" data (spez vs photoz)

In [1]:
# process training set (not actually used, just to get right shape of dataframe)

In [2]:
# process test set (done in "chunks")

# Model (Section 2 in the .pdf)
* this is implemented in the code as "predict.py"


* **Gradient Boosted Classification Tree**
* **Separate model for each class:**
  * (each class in the Training data is either only galactic or only extra-galactic)
  * (hostgal_photoz = 1)
  * Thus -> she trains model for galactic classes on galactic data, and extra-galactic classes using extra-galactic data)
* **Train separate models for "exact" and "approximate" redshift**
* **Test data is quite different than training ...**
  * To prevent overfitting used early stopping in LightGBM
  * validation set sampled from training data - resampled w/ distribution to reflect test data 


... having trouble importing lightgbm.  see:
https://github.com/Microsoft/LightGBM/issues/566

```
need to run python **64bit** not **32bit**
https://www.python.org/downloads/windows/

python 3.7.5 **64-bit**
+=================================
when running python 64bit ...

need to install scipy and scikit_learn using "wheel" file

  26 pip install C:\Users\Chris\Downloads\scipy-1.3.1-cp38-cp38-win_amd64.whl
  28 pip install C:\Users\Chris\Downloads\scikit_learn-0.21.3-cp38-cp38-win_amd64.whl

References:
    https://stackoverflow.com/questions/26657334/installing-numpy-and-scipy-on-64-bit-windows-with-pip
    https://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy
    https://pip.pypa.io/en/latest/user_guide/#installing-from-wheels

+============
to script this... (to be done)
Invoke-WebRequest cmdlet
https://4sysops.com/archives/use-powershell-to-download-a-file-with-http-https-and-ftp/

```


In [2]:
import pandas as pd
import numpy as np
from sklearn import metrics, model_selection
import lightgbm as lgb
import os

In [None]:
# if test_mode is True, just run training and cross-validation on training data;
# if False, also make predictions on test set
test_mode = False

# read data
all_meta = pd.read_hdf(os.path.join('data', 'features', 'all_data.hdf5'), key='file0')
train_meta_approx = pd.read_hdf(os.path.join('data', 'features', 'train_meta_approx.hdf5'), key='file0')
train_meta_exact = pd.read_hdf(os.path.join('data', 'features', 'train_meta_exact.hdf5'), key='file0')
