# 2018-01-12 / FMA sub-sampling

* Problem statement:
  * Input:
    * `C` csv files
    * each file has `n` rows. Each row in file `c` encodes the prediction for class `c` on a 1sec segment.
    * A target number `k`
    * Target fractions for class representations `p[c]`.
    
  * Output:
    * A set of `k` clips, each 10 seconds in duration
    * Aggregate predicted likelihoods for each class `c` on each clip `k`
    * Each class `c` has aggregate likelihood at least `p[c] * k`


* Method:
  1. drop edge effects from the beginning and end of tracks: remove the first and last frames from each track.
  2. window the frame observations into 10sec clips with aggregate labels
  3. threshold the aggregate likelihoods to binarize the representation
  4. subsample the 10sec clips using entrofy


* Questions:
  * How should likelihoods be aggregated within a segment?
    * Mean?  Max?  Quartile?
    * Mean makes sense from the perspective of random frame sampling
    * Quartile makes sense wrt sparse events
    * Max makes sense wrt extremely sparse events
  * How should likelihoods be thresholded?  0.5?  Empirical average over X?
    * $p[y] = \sum_x p[y|x] * p[x] \approx \sum_{x \in X} p[y|x] /|X| $
    * But that doesn't matter really.  Threshold should be bayes optimal (=> 0.5)
  * What's the target number of positives per class `k * p[c]`? 
    * Maybe that should be determined by the base rate estimation `p[y]`?
  
  
* Next step: Question scheduling on CF.
  * Idea: cluster the tracks according to aggregated likelihood vectors
    * Or maybe by their thresholded likelihoods?
  * Set the number of clusters to be relatively large (say, 23^2 ~= 512)
  * When generating questions for an annotator, assign them to a cluster and only generate questions from that cluster
  * Reasoning: this will keep the labels consistent from one question to the next
  
  
* UPDATE:
  * Windowing and aggregation is happening upstream of this
  * Aggregation is max over the middle 8 frames
  

# 2018-01-19

* Eric has provided the per-fragment aggregated estimates as one giant table
* So what are our entrofy parameters?
  * attribute thresholds
      * Do we only do <>0.5? 
      * Or break likelihood into quartiles?
      * **Sounds like quartiles are the way to go**
  * target proportions per class?
      * we can try to preserve the empirical distribution
      * or a biased distribution achieved by grouping on the track ids?
      * or uniform?
      * **Uniform across quartiles for each instrument**
  * output set size?
      * 20-50 positives per instrument?
      * say, `16 * 4 * n_classes`
      * Maybe round up to 1K to start
  
* If we only want one example per track, we can make an aux categorical column that's the track index, and set the target number to 1

# 2018-02-02

* Turns out we didn't get the data transferred in time on 01/19, so still waiting
* output set size: 500-1000 positives per class
* try both hard threshold and quartile sampling

In [1]:
import numpy as np
import pandas as pd
import entrofy

In [4]:
pd.read_csv?

In [None]:
mappers = {col: entrofy.mappers.ContinuousMapper(df[col], n_out=4,
                                                 boundaries=[0.0, 0.25, 0.5, 0.75, 1.0]) for col in df}

In [3]:
idx, score = entrofy.entrofy(df, 1000, mappers=mappers, n_trials=100)