In [None]:
%Date: 04.01.2016

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import math

Approximate Median Significance (AMS) defined as:      
$$AMS = \sqrt{2 { (s + b + b_r) log[1 + (s/(b+b_{reg}))] - s}}$$  

where:    
- $b_{reg} = 10$ is a regulization term (set by the contest),
- $b = \sum_{i=1}^{n} w_i, y_i=0$ is sum of weighted background (incorrectly classified as signal),
- $s = \sum_{i=1}^{n} w_i, y_i=1$ is sum of weighted signals (correctly classified as signal),
- $log$ is natural logarithm

In [2]:
def calcAMS(s,b):    
    br = 10.0
    radicand = 2 *( (s+b+br) * math.log (1.0 + s/(b+br)) -s)
    if radicand < 0:
        print('radicand is negative. Exiting')
        exit()
    else:
        ams = math.sqrt(radicand)
        print("AMS:", ams)
        return ams

Following this definition, we can derive a maximum AMS by simply summing the weights of all positive labels.

In [3]:
def calcWeightSums(weights,preds,labels):
    s = 0
    b = 0
    for j in list(range(0,len(preds))):
        pred = preds[j]
        label = labels[j]
        weight = weights[j]
        if pred > 0.:
            if label > 0.:
                s += weight
            else:
                b += weight
    return s,b

Data shall have the form of $[w,y,x_1,x_2]$ where

- $w$ is a weight in the intervall $[0,1)$
- $y$ is the label "0" for "background" or "1" for "signal"
- $x_n$ are randomly generated features with respect to the label

In [4]:
def calcMaxAMS(weights,labels):
    s,b = calcWeightSums(weights,labels,labels)
    ams = calcAMS(s,b)
    print("Found", int(labels.cumsum()[-1]), "signals.")
    print("Weightsums signal:", s, "| background:", b)
    print("Maximum AMS possible with this Data:", ams)
    return ams

We generate AMS with good seperable toy-data, starting with the maximum AMS. 
The data of the actual challenge is weighted to punish wrong-identified signals significantly harder than wrong background. Our toy-data will do so by using its signal-probability as weight, the features are randomized by normal distributions.


In [5]:
def generateFeature(label, mu_s, mu_b, sigma_s=5, sigma_b=5):
    if label is 1:
        mu = mu_s
        sigma = sigma_s
    else:
        mu = mu_b
        sigma = sigma_b
    return np.random.normal(mu,sigma)

In [6]:
def createToyData(n = 100,dim = 3,s_prob = 0.05):
    data= np.zeros(shape = (n,dim),dtype=float)
    if dim < 3:
        print("Operation canceled.",
              "Data should have at least one",
              "additional dimension besides weights and labels.",
              "(dim >=3)")
        return None
    data[:,0] = np.random.rand(n) #weights
    for i in range(0,n):
        if data[i,0] <= s_prob: # label-determination
            label = 1
        else:
            label = 0
        data[i,1] = label
        for j in range(2,dim):
            #mu_s=j*5
            #mu_b=j*20
            data[i,j]=generateFeature(label,mu_s=(j-1)*5,mu_b=(j-1)*20)
    return data

In [7]:
n = 100000
prob = 0.05
data = createToyData(n,dim=10,s_prob=prob)

In [8]:
weights = data[:,0]
labels = data[:,1]
calcMaxAMS(weights,labels);

AMS: 21.19484139384361
Found 5020 signals.
Weightsums signal: 124.326027277 | background: 0
Maximum AMS possible with this Data: 21.19484139384361


We randomly guess labels for a solution for a second AMS with knowledge about the toydatas signal-probability.

In [9]:
sol_weights = np.random.rand(n)
sol = np.zeros(n)
for i in range(0,n):
    if sol_weights[i] <= prob: # label-determination
        sol[i] = 1
    else:
        sol[i] = 0

In [157]:
s,b = calcWeightSums(weights,sol,labels)
br = 10

In [169]:
s*=10

In [163]:
b*=10

In [170]:
s+b+br

25055.584526883151

In [171]:
1.0 + s/(b+br)

1.002459486418638

In [172]:
math.log(1.0 + s/(b+br))

0.002456466831991091

In [173]:
math.sqrt( 2 *( (s+b+br) * math.log (1.0 + s/(b+br)) -s) )

0.38867392440571713

In [174]:
calcAMS(s,b);

AMS: 0.38867392440571713


AMS-Scores:
1.  (rank #998, rank #999 with k = 100)   
    kNN: 3.1689810059694348 with k = 200 and featurelist:
    * "DER_mass_MMC",
    * "DER_mass_transverse_met_lep",
    * "DER_mass_vis",
    * "DER_met_phi_centrality",
    * "DER_pt_ratio_lep_tau",
    * "PRI_tau_pt",
    * "DER_pt_h"
2. (rank #1473)   
    logistic Regression: 2.0563933037592506 with all features