# In this notebook we continue with "btrotta"

# Model (Section 2 in the .pdf)
* this is implemented in the code as "predict.py"


* **Gradient Boosted Classification Tree**
* **Separate model for each class:**
  * (each class in the Training data is either only galactic or only extra-galactic)
  * (hostgal_photoz = 1)
  * Thus -> she trains model for galactic classes on galactic data, and extra-galactic classes using extra-galactic data)
* **Train separate models for "exact" and "approximate" redshift**
* **Test data is quite different than training ...**
  * To prevent overfitting used early stopping in LightGBM
  * validation set sampled from training data - resampled w/ distribution to reflect test data 


... having trouble importing lightgbm.  see:
https://github.com/Microsoft/LightGBM/issues/566

```
need to run python **64bit** not **32bit**
https://www.python.org/downloads/windows/

python 3.7.5 **64-bit**
+=================================
when running python 64bit ...

need to install scipy and scikit_learn using "wheel" file

  26 pip install C:\Users\Chris\Downloads\scipy-1.3.1-cp38-cp38-win_amd64.whl
  28 pip install C:\Users\Chris\Downloads\scikit_learn-0.21.3-cp38-cp38-win_amd64.whl

References:
    https://stackoverflow.com/questions/26657334/installing-numpy-and-scipy-on-64-bit-windows-with-pip
    https://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy
    https://pip.pypa.io/en/latest/user_guide/#installing-from-wheels

+============
to script this... (to be done)
Invoke-WebRequest cmdlet
https://4sysops.com/archives/use-powershell-to-download-a-file-with-http-https-and-ftp/

```


In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics, model_selection
import lightgbm as lgb
import os

In [4]:
# if test_mode is True, just run training and cross-validation on training data;
# if False, also make predictions on test set
test_mode = False

# read data
all_meta = pd.read_hdf(os.path.join('data', 'features', 'all_data.hdf5'), key='file0')
train_meta_approx = pd.read_hdf(os.path.join('data', 'features', 'train_meta_approx.hdf5'), key='file0')
train_meta_exact = pd.read_hdf(os.path.join('data', 'features', 'train_meta_exact.hdf5'), key='file0')


FileNotFoundError: File data\features\all_data.hdf5 does not exist

# But first a review on LightGBM...
* https://mlexplained.com/2018/01/05/lightgbm-and-xgboost-explained/  ***<--a "must-read"***
* https://lightgbm.readthedocs.io/en/latest/index.html

# Gradient Boosting Decision Trees (GBDT)
* **Common implementations:**
  * xboost (came into favor ~2015)
  * LightGBM (Microsoft ~2018)
  

* **Algorithm:**

  1. Create a decision tree "Model 1"
    1. at each node choose attribute (entropy, information gain)
    * build sub-nodes, repeating downward to build tree
    * stop when some limit is reached:
      * no more attributes (exhaustive)
      * meet some criterion based on loss function... f(class, regularization)
      * meet some criterion based on early-stopping
    * post-pruning

  * Create a modified version of the dataset
    * New Class = original dataset class - prediction from Model 1 
    * ***i.e. the "New Class" is the "residual" (error left behind from Model 1)***
    * we are building a model to predict the error...
    * `y = m1(x) + e`
      * where:
        * y is the class
        * m1(x) is m1's prediction of y
        * "e" is the error
    * solving for "e"
      * `e = y - m1(x)`

  * Create a NEW decision tree ("Model 2")
    * <same steps as above, but predicting the New Class variable

  * **We now have 2 models**
    * The prediction of the first is good
    * The prediction of the second, **when added to the first,** ***is better***

  * Repeat (building more models, each predicting the residual...)
    * until ...
    



### Hyperparameters...
* subject to overfitting.
* done using validation data
* "out of the bag sample" or "cross-validation"
* best time to stop when validation error has decreased and starts to stabilize, before starts to increase due to overfitting




***LightGBM is a gradient boosting framework that uses tree based learning algorithms.***

(https://lightgbm.readthedocs.io/en/latest/Experiments.html)

***out-performs peers***

| | xgboost | xgboost_hist | LightGBM |
| --- | --- | --- | --- |
| Speed | 3x to 10x | 1.5x to 4x | 1x |
| Accuracy | | | 0 to 3% better |
| Memory (GB) | 1.5 to 6.2 | 1.4 to 5.0 | 0.8 to 1.0 |


In [69]:
import numpy as np
import pandas as pd
import lightgbm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

In [71]:
train_data = lgb.Dataset('train.svm.bin')
print(type(train_data))

<class 'lightgbm.basic.Dataset'>


In [72]:
# load a numpy array into Dataset:
data = np.random.rand(500, 10)  # 500 entities, each contains 10 features
label = np.random.randint(2, size=500)  # binary target
train_data = lgb.Dataset(data, label=label)

In [73]:
print(data)

[[0.55379504 0.38774458 0.52724296 ... 0.84889923 0.1218365  0.88601886]
 [0.61361511 0.85954033 0.06751338 ... 0.45436543 0.63363454 0.0256262 ]
 [0.66805052 0.15698581 0.65909552 ... 0.9920577  0.59019986 0.92296939]
 ...
 [0.03583564 0.25896966 0.08475954 ... 0.34839232 0.42582798 0.10397257]
 [0.68507483 0.56182096 0.18288224 ... 0.29088559 0.0247573  0.70658587]
 [0.3269044  0.14248502 0.18087515 ... 0.94742193 0.14309186 0.6066268 ]]


In [60]:
#import scipy
#csr = scipy.sparse.csr_matrix((dat, (row, col)))
#train_data = lgb.Dataset(csr)

In [74]:
train_data.save_binary('train.bin')

<lightgbm.basic.Dataset at 0x240d96404c8>

In [83]:
validation_data = train_data.create_valid('validation.svm')
print(type(validation_data))

<class 'lightgbm.basic.Dataset'>


In [84]:
train_data = lgb.Dataset(data, label=label, feature_name=['c1', 'c2', 'c3'], categorical_feature=['c3'])

In [85]:
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)

In [86]:
param = {'num_leaves': 31, 'objective': 'binary'}
param['metric'] = 'auc'


In [87]:
param['metric'] = ['auc', 'binary_logloss']

In [88]:
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])

LightGBMError: Cannot open data file validation.svm