# kaggle_plasticc

This Jupyter Notebook explores the Kaggle PLAsTiCC dataset and competition. The competition and dataset can be found at https://www.kaggle.com/c/PLAsTiCC-2018 

This notebook can be found at  https://github.com/cwinsor/kaggle_plasticc

PLAsTiCC is the Kaggle competition for LSST (Large Synoptic Survey Telescope).  The telescope, under construction at a remote mountaintop in Peru, will capture alltogether new data from the night skies, and analyze that data using alltogether new techniques.

LSST performs **"time domain astronomy"** which means it is looking for short-term events called "transients". There are dozens of known classes of transients - single-occurrence events like novae and micro-lensing, periodic events like eclipsing binaries. Micro-lensing is the way astronomers discover planets orbiting stars in other galaxies - like earth in another galaxy!

LSST is vast in its ability to observe. LSST will capture the entire southern hemisphere every 4 nights searching for transients. **LSST researchers expect there are about 1M transient events every night.**

Finding those events in the vast quantities of data requires new technique in classification.  LSST will is exploring new boundaries in automated classification, and that is the goal of PLAsTiCC.  **The goal for LSST is to flag a one-time event (start of a Novae) within 60 minutes of observation** and notify other astronomers so they start closer observation. And that's the reason for PLAsTiCC - to establish tools for classification that can handle data on this scale.  

That is Pretty Cool Stuff!


# The Plan
The steps in my investigation of PLAsTiCC are planned to be:

a) Review the "starter Kit"   https://www.kaggle.com/michaelapers/the-plasticc-astronomy-starter-kit
b) Follow the example     https://www.kaggle.com/michaelapers/the-plasticc-astronomy-classification-demo 

The example starts by using the Cesium library to "featurize" the timeseries data

So we take a dive into Cesium - library to "featurize" time-series data
good example !
http://cesium-ml.org/docs/auto_examples/plot_EEG_Example.html#sphx-glr-auto-examples-plot-eeg-example-py 

The example also references/uses a library to establish period of a repeating signal that is sampled infrequently
<interesting stuff, and very useful>
<need to revisit this>
    
The example then reduces the number of features and evaluates using correlation, PCA, Confusion Matrix
<more work needed here>


The Dataset:
* The dataset has two parts
* "metadata" - aka - red shift and color frequency bands for each item
* "luminescence" - aka - the brightness for each item

To simplify things:
* Start with just <luminescence> - no metadata
* Start with just "galactic" data (our galaxy)
* Above will simplify things a lot
    
My plan is:
1) Wrangle and get familiar with the data - a bit
* For each of the (N) categorical data types - investigate ONE instance.
* This is a single star in our our galaxy which we know is an "XYZ".
* Investigate the properties and/or behavior of that star.
* Confirm we know a little about each category and what their observable properties are.

2) Investigate approaches taken by other Kagglers:
* This may take some time - review "top 100" models - categorize as to what they were predicting, the approach taken.
* Create a table.

3) Investigate approaches taken by other Astronomers ....
* 


## Reviewing the Discussion Board
https://www.kaggle.com/c/PLAsTiCC-2018/discussion


* Paper on the Performance Metric **good read** https://arxiv.org/abs/1809.11145
 * Introduction:
   * The "Performance Metric" is chosen to prefer Probabilistic Classification vs Determinimistic Classification
     * Probabillistic Classification delivers a probability distribution over class values.  Deterministic Classification delivers a MAP (max a-priori) for a class value.
     * The reasons are:
       * Resources for analysis are limited - model users need model that assists in making tradeoff decisions
       * Data can be incomplete - a probabilistic model allows inference on incomplete data
       * Recent research on hierarchical inference - this is a hierarchical model
     * ***I believe they are missing the intent which is to separate representation (the model) from inference (the querying). Reference Koller.***
   * In the summary they conclude
     * "we sought a metric (...) that avoids deterministic labels (preferring) one with strong performance across all classes"
     * Brier score vs log-loss.   Sum-of-sqare differences between probabilities vs Log-loss
     * They also state "Both metrics are susceptible to rewarding a classifier that performs well on the most prevalent class and poorly on all others" 
     * In summary "we select a per-class weighted log-loss as the optimal choice for PLAsTiCC"
   * Note that log-loss is Entropy where p() is the conditional probability p(m|d) where m = light_curve has true class m and d is the data
   * They also use cross-entropy which they describe as increase in disorder
   * **several references here** including the KullbackLeibler Divergence


### From Wikipedia:
![Cross_Entropy,bits_to_transmit_dist_1_using_dist_2](images/pic01_cross_entropy_def.png)


!["foo","bar"](images/pic02_cross_entropy_def2.png)


### From the article:
!["foo","bar"](images/pic03_metric_definitions_and_indicator.png)

### notes..
note they have not yet defined the metric Qn ...

### Defining cross-entropy as the metric
!["foo","bar](images/pic04_metric_value_definition_using_log_loss_cross_entropy.png)