# kaggle_plasticc

This Jupyter Notebook explores the Kaggle PLAsTiCC dataset and competition. The competition and dataset can be found at https://www.kaggle.com/c/PLAsTiCC-2018 

This notebook can be found at  https://github.com/cwinsor/kaggle_plasticc

PLAsTiCC is the Kaggle competition for LSST (Large Synoptic Survey Telescope).  The telescope, under construction at a remote mountaintop in Peru, will capture alltogether new data from the night skies, and analyze that data using alltogether new techniques.

LSST performs **"time domain astronomy"** which means it is looking for short-term events called "transients". There are dozens of known classes of transients - single-occurrence events like novae and micro-lensing, periodic events like eclipsing binaries. Micro-lensing is the way astronomers discover planets orbiting stars in other galaxies - like earth in another galaxy!

LSST is vast in its ability to observe. LSST will capture the entire southern hemisphere every 4 nights searching for transients. **LSST researchers expect there are about 1M transient events every night.**

Finding those events in the vast quantities of data requires new technique in classification.  LSST will is exploring new boundaries in automated classification, and that is the goal of PLAsTiCC.  **The goal for LSST is to flag a one-time event (start of a Novae) within 60 minutes of observation** and notify other astronomers so they start closer observation. And that's the reason for PLAsTiCC - to establish tools for classification that can handle data on this scale.  

That is Pretty Cool Stuff!


# The Plan
The steps in my investigation of PLAsTiCC are:

1. Review the "starter Kit" https://www.kaggle.com/michaelapers/the-plasticc-astronomy-starter-kit
2. Review example from the host https://www.kaggle.com/michaelapers/the-plasticc-astronomy-classification-demo *(note I moved my notes from this to the bottom - it is not consistent with the leaderboard techniques)*
3. ***Review leaderboard kernels (from Discusison Board)***
4. Review the "Info For the Challenge" from hosts - describing the Performance Metric


# The example from host
The example starts by using the Cesium library to "featurize" the timeseries data

So we take a quick dive into Cesium - a library to "featurize" time-series data
good example !
http://cesium-ml.org/docs/auto_examples/plot_EEG_Example.html#sphx-glr-auto-examples-plot-eeg-example-py 

The example also references/uses a library to establish period of a repeating signal that is sampled infrequently
<interesting stuff, and very useful>
<need to revisit this>
    
The example then reduces the number of features and evaluates using correlation, PCA, Confusion Matrix
<more work needed here>

## Reviewing Leaderboard Models (posted on Discussion Board)...
https://www.kaggle.com/c/PLAsTiCC-2018/discussion

## "20th Place Solution" (Giba)
https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75262#latest-527064
* Summary:
 * Stack ensemble and a blend of 4x LGB and 2x SVC models trained using different subsets of features.
* Features:
  * tried > 1000, pruned down to 250
  * aggregations and statistics, feets lib, coefficients of linear...
* Models:
  * for each - augmented train set w/ noise (?)
  * Model 1 - multi-class LGB
  * Model 2 - multi-class LGB
  * Model 3 - multi-class LGB
  * Model 4 - multi-*label* LGB
  * Model 5 - multi-*label* SVC
  * Model 6 - multi-class LGB
  * Model 7 - multi-*label* LGB
  * Stack Ensemble 1
  * Stack Ensemble 2
  * Stack Ensemble 3
  * Blend Stack Ensemble 1,2,3
  * Class 99 - used "Scirpus" equation 
* What Didn't work:
  * ...
  
  

# "9th place solution" (Albert Garreta)
* https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75316#latest-495584
* In Summary:
  * careful feature engineering and ensembling
  * stacking predictions of lgb, catboost, and nn, then a weighted average with previous best submission
* First-level Models
  * a single lgb (teammates pursued other)
* Features
  * 8000 features
  * prune to (80 + 130) using LGB
    * @manugangler's kernel fitting light curves to microlensing events
    * distmod/log10(hostgal_photoz)
    * log_10( luminosity_mean )
    * used different sets of time observations (all, detected, undetected)
    
   

# "2nd-Place Solution Notes" (Silogram)
* https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75059#latest-462457
* In Summary
  * Neual net (details in discussion)
  * meta-model as GBDT model(depth = 2)
* Notes:
  * The Gap: they did not have "LV vs CV" (?) problem
  * Class 99 - left to the end
  * Final ensemble shallow - depth=2
* Feature Engineering:
   * "Detected" flag
   * "hostgalspez"
   * flux adjustment
   * MJD adjustment
* Augmentation - tried a variety - helped NN but not LGB

# What is "LGB" ?
* "Linear Gradient Boosting"
* https://en.wikipedia.org/wiki/Gradient_boosting
* In summary
  * LGB starts with a weak model.
  * It then considers that a better model would happen by linearly adding a "correction".
  * Simple algebra - the "correction" is the difference between the true class values, and the predicted class values
  * So - train a new model to predict the "correction"
  * that's the general idea - see Wikipedia!

# "14th place solution" (Belinda Trotta)
* https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75054#latest-448552
* ***well documented - a good starting point?***
* ***pdf with details is available***
* ***source is available on github***
  
* Summary:

  * LightGBM
  * run in 6 hours on a 24GB laptop
  * only elementary operations (no curve fitting or optimization - keeps runtime down)
* Feature/engineering
  * remove noise from flux (Bayes approach)
  * create scaled version of flux 
  * peaks - added features to capture the behaviour around the peak
  * "Understanding how to optimise the metric" ? 
  


# "Solution #5 tidbits (revised with code)" (CPMP)
* https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75050#latest-447982
* ***good feature engineering***
* ***pdf and code on github***
* Summary:
  * lightgbm (almost exclusively)
  
* Feature Engineering Approach
  * hand crafted from light curves (used Pandas)
  * did NOT use packages (light gatspy, cesium, tsfresh) because they were "way slower" and "not as good"
* Features
  * Std, skewness, kurtosis. **Goal = differentiate curves w/peaks (supernova) from others. Also captures shape of peak**
  * Ratios e.g. max passband divide by largest. **Gives a view of the spectra.**
  * Features based on difference between successive measurements. **Isolates rapidly varying sources from others**
  * Magnitude of Flux (need to compute this to start with)
  * MJD diff (days) feature - shared by Sionkowski in forum
  * Normalized flux
  * Curve fitting (?)
  
* Objective Function
  * ***this is important and I don't understand it*** he is looking at the competition performance metric
* Training-time Augmentstion
* Stacking ("an immediate benefit of teaming")
  


# What is "LightGBM" ?
https://lightgbm.readthedocs.io/en/latest/


# #13 Solution, true story: tries and fails (Blonde)
* https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75134#latest-445370
* ***first-hand sense of fun - first kaggle?!***
* Summmary:
  * LGBM
  * feature design and selection
  * augmentation
  * "Not much ensembling, 50/50 blend of two LGBM models"
* Features:
  * Magnitude (aggregated and per-band)
  * Parametric curve fittings (Bazin - reference)
  * extended Baizen
  * cesium (basic ratios, std, skew)
  * checked all features from "feets" (e.g. CAR tau from feets).  Optimized this 10x https://feets.readthedocs.io/en/latest/tutorial.html

* "Battling over-fitting"
  * augmented training (flux)
  
  
  

# 4th Place Solution with Github Repo (Ahmet Erdem)
https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75011#latest-444878
* ***code on github***
* Summary:
  * a blend of LGB, NN and several stacking models
* Features that worked:
  * Ratios (passband / all passbands)
  * Stacking e.g. Simple Logistic Regression on LGB  <-- maybe a good exercise?
  * sub-model (trained model to predict Hostgalspecz, then used model prediction as feature)
  * normal and log-transformed (both) on Neual Net (allows +,-,*,/)
  * Baizen (light-curve fitting)
  

 

# 12th Place Solution (Daniel Bi)
* Summary:
  * ensemble of LightGBM, XGBoost, and binary classifiers.
* Feature Engineering
  * light curves - 50-60 features. Did both passband-level and object-level features.  Did NOT use libraries
  * For frequency extraction - used kernel provided by third-party, groupd into 20 bins
  * Baizen (light curve fitting)
  * flux adjustments (a variety) and 
* Models:
  * three models: galactic, extragalactic with specz, and extragalactic without specz
  "Passband Level Model" = LightGBM on light curve only
  "Object Level Model" = LightGBM  100 features (galactic), 140 features (extragalactic)
* Ensemble
  * LightGBM, XGBoost and binary classification predictions
  
 
  

# Overview of 1st place solution (Kyle Boone)
https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75033#latest-457546
* an astrophysicist
* code available on https://github.com/kboone/avocado
* Summary   
  * most of work was put into separating super-novae apart because ("everything else was fairly easy to tell apart")
  * ***-> in other words - use time wisely - ML will solve part of the prediction easily - focus where ML needs help***
  * augmented training set by "degrading the well-observed lightcurves in the training set to match the properties of the test set".
  * ***-> in other words - the training set only had a few examples of well obseved light curves so it is necessary to augment these to prevent over-fit***
  * use gaussian process to predict light curves
  * ***-> what does this mean? - review code***
  * measured 200 features on raw data and Gaussian process predictions
  * trained single LGBM model with 5-fold cross-validation
* ***-> There is a WEBSITE on "Avocado" https://avocado-classifier.readthedocs.io/en/latest/***
  

# ------------ next steps - review code from examples! -----------


# ------------- end ------------------------

# What follows are raw notes from my review of the paper on Performance Metrics.  Very interesting but too much detail.

## "Info for the Challenge"
https://www.kaggle.com/c/PLAsTiCC-2018/discussion/67376#latest-439538
* Paper on the Performance Metric **good read** https://arxiv.org/abs/1809.11145
 * Introduction:
   * The "Performance Metric" is chosen to prefer Probabilistic Classification vs Determinimistic Classification
     * Probabillistic Classification delivers a probability distribution over class values.  Deterministic Classification delivers a MAP (max a-priori) for a class value.
     * The reasons are:
       * Resources for analysis are limited - model users need model that assists in making tradeoff decisions
       * Data can be incomplete - a probabilistic model allows inference on incomplete data
       * Recent research on hierarchical inference - this is a hierarchical model
     * ***I believe they are missing the intent which is to separate representation (the model) from inference (the querying). Reference Koller.***
   * In the summary they conclude
     * "we sought a metric (...) that avoids deterministic labels (preferring) one with strong performance across all classes"
     * Brier score vs log-loss.   Sum-of-sqare differences between probabilities vs Log-loss
     * They also state "Both metrics are susceptible to rewarding a classifier that performs well on the most prevalent class and poorly on all others" 
     * In summary "we select a per-class weighted log-loss as the optimal choice for PLAsTiCC"
   * Note that log-loss is Entropy where p() is the conditional probability p(m|d) where m = light_curve has true class m and d is the data
   * They also use cross-entropy which they describe as increase in disorder
   * **several references here** including the KullbackLeibler Divergence


### From Wikipedia:
![Cross_Entropy,bits_to_transmit_dist_1_using_dist_2](images/pic01_cross_entropy_def.png)


!["foo","bar"](images/pic02_cross_entropy_def2.png)


### From the article:
!["foo","bar"](images/pic03_metric_definitions_and_indicator.png)

### notes..
note they have not yet defined the metric Qn ...

### Defining cross-entropy as the metric
!["foo","bar](images/pic04_metric_value_definition_using_log_loss_cross_entropy.png)