# ATLAS Classification Experiments

In this notebook, I want to answer some questions. The main tool for investigation will be a set of uncertainty measures generated with $k$-fold committee classification. Each ATLAS object will be assigned an uncertainty.

- What is the distribution of these uncertainties?
- Are the very certain objects interesting?
- Are the very uncertain objects interesting?
- If we compute a similar uncertainty for each IR object, is there a relationship between the uncertainty of the IR objects and the radio subjects they appear near?

## Setup: A classifier for the Radio Galaxy Zoo

In this section, I'll build a classifier class to make the problem a little easier to grasp. A `RGZClassifier` will have access to all IR objects in the crowdastro database. It will have a `train` method which takes an IR label training set and a set of IR indices as arguments and trains the classifier, and a `predict` method which takes an ATLAS subject vector and returns an IR index.

In [47]:
import h5py
import numpy
import sklearn.linear_model
import sklearn.pipeline
import sklearn.preprocessing

ARCMIN = 1 / 60
CROWDASTRO_H5_PATH = '../crowdastro.h5'
IMAGE_SIZE = 200 * 200
TRAINING_H5_PATH = '../training.h5'

In [48]:
class RGZClassifier(object):
    
    def __init__(self, ir_features, n_astro):
        self.ir_features = ir_features
        self.n_astro = n_astro
        self._classifier = sklearn.linear_model.LogisticRegression(class_weight='balanced')
        self._astro_transformer = sklearn.pipeline.Pipeline([
            ('normalise', sklearn.preprocessing.Normalizer()),
            ('scale', sklearn.preprocessing.StandardScaler()),
        ])
        self._image_transformer = sklearn.pipeline.Pipeline([
            ('normalise', sklearn.preprocessing.Normalizer()),
        ])
    
    def _fit_transformers(self, ir_indices):
        self._astro_transformer.fit(self.ir_features[ir_indices, :self.n_astro])
        self._image_transformer.fit(self.ir_features[ir_indices, self.n_astro:])
    
    def _transform(self, features):
        return numpy.hstack([
            self._astro_transformer.transform(features[:, :self.n_astro]),
            self._image_transformer.transform(features[:, self.n_astro:]),
        ])
    
    def train(self, ir_indices, ir_labels):
        self._fit_transformers(ir_indices)
        ir_features = self._transform(self.ir_features[ir_indices])
        self._classifier.fit(ir_features, ir_labels)
    
    def predict(self, atlas_vector):
        # Split the ATLAS vector into its components.
        position = atlas_vector[:2]
        image = atlas_vector[2:2 + IMAGE_SIZE]
        distances = atlas_vector[2 + IMAGE_SIZE:]
        # Get nearby IR objects and their features.
        nearby_indices = (distances < ARCMIN).nonzero()[0]
        ir_features = self._transform(self.ir_features[nearby_indices])
        # Find how likely each object is to be the host galaxy.
        probabilities = self._classifier.predict_proba(ir_features)[:, 1]
        # Return the index of the most likely host galaxy.
        return nearby_indices[probabilities.argmax()]

In [49]:
# Some helper functions for testing.
def get_groundtruth_labels(atlas_vector, ir_labels):
    distances = atlas_vector[2 + IMAGE_SIZE:]
    nearby_indices = (distances < ARCMIN).nonzero()[0]
    nearby_labels = ir_labels[nearby_indices]
    groundtruth = nearby_labels.nonzero()[0]
    return nearby_indices[groundtruth]

In [50]:
# Let's test it out.
with h5py.File(TRAINING_H5_PATH, 'r') as training_h5, \
     h5py.File(CROWDASTRO_H5_PATH, 'r') as crowdastro_h5:
    ir_features = training_h5['features'].value
    ir_labels = training_h5['labels'].value
    ir_train_indices = training_h5['is_ir_train'].value.nonzero()[0]

    classifier = RGZClassifier(ir_features, 5)
    classifier.train(ir_train_indices, ir_labels[ir_train_indices])

    atlas_vectors = crowdastro_h5['/atlas/cdfs/numeric']
    atlas_test_indices = training_h5['is_atlas_test'].value.nonzero()[0]
    
    n_correct = 0
    n_total = 0
    for atlas_index in atlas_test_indices:
        atlas_vector = atlas_vectors[atlas_index]
        groundtruths = get_groundtruth_labels(atlas_vector, ir_labels)
        prediction = classifier.predict(atlas_vector)
        n_correct += prediction in groundtruths
        n_total += 1
    
    print('{:.02%}'.format(n_correct/n_total))

75.00%


## Uncertainty by committee

In this section, I'll make a committee of 10 classifiers which will learn on different subsets of the training data. I'll use 5-fold cross-validation to generate 10 classifications for each ATLAS object, and the percentage of agreement with the majority will form an uncertainty estimate for each object.