This is an exercise to determine the importance of sensor features given a labelled dataset.

### Result

The final ranking I got for the importance of the sensors was: 6, 8, 4, 0, 2, 3, 9, 5, 7, 1. One can predict the label using only sensor 6 with very high accuracy (above 95%). There was only a few percent difference between some sensors - sensor 0 and 2 specifically had roughly the same predictive power of about 80%. Sensors 1, 7 and 5 had percentages in the low to mid 50’s, which is little more than random in terms of predictive power.

### Method

I trained a Support Vector Machine (SVM) for each individual feature and ranked the features according to the quality of the models.

As is standard with this approach, I split the data into a training set to train on and a test set to evaluate the quality of the model. Since the dataset is small, it makes sense to have a fairly large test set and I also chose an equal number of positive and negative samples for each set. If a set is too small or it has much more positive or negative samples than the overall dataset (which is more likely with smaller sets), it may not be representative. If the test set isn't representative, the score on the test set wouldn't be useful.

I used an approach similar to k-fold cross-validation to reduce the variability of the ranking. I trained and evaluated multiple models using different subsets as the test set, and then averaged the scores each model got on its test set to get the final score. I looked at the standard deviation of the scores to get a more detailed understanding of the results and ensure there weren’t outliers. I used k=4 (4-fold cross-validation) to give a fairly large test set.

### Advantages and disadvantages of this approach

The overall approach is reasonably fast, although not as fast as some others. The number of models required is linear in the number of features, although the complexity of each model is low, given that it’s only dependent on one feature. How long each of those models take to train would be dependent on the number of samples. It strikes a good balance between performance and solution quality.

SVMs work fairly well on small datasets, don’t take particularly long to train and captures somewhat complex relationships between the feature and the label.

It does not evaluate how predictive combinations of features might be. Some set of features considered together might predict the label with high accuracy, yet taken individually they aren’t very predictive.

### Other possible approaches

The correlation coefficient of the label against each feature can be used to measure importance. This is very fast, although it only measures the simple relationship of what happens to one variable when the other increases.

Principal Component Analysis can be useful to eliminate redundant features. Although this involves transforming the features into components, which is difficult to translate back to the importance of a single feature.

For each feature, one can train a model without that feature. Eliminate the feature which gave the lowest (or highest) score on the test data. Repeat this until there is only one feature left. This would address the weakness of the selected approach by capturing the importance of features in combination. Although it is a fairly slow approach, requiring a quadratic number of models to be trained. Depending on the type of model and hyperparameters used, training each individual model may also take some time and potentially require some parameter tweaking to get right. I quickly tried this approach, although it was heavily affected by noise, making the end result somewhat random and not suitable to draw conclusions from.

One can also train a model on every subset of features and judge the importance of a feature based on the performance of the models with and without that feature. Compared to the above, this would be a good way to avoid potentially early on eliminating single features which are highly correlated with the output, but is less predictive than a subset of other features. However, this would be extremely slow: an exponential number of models will need to be trained.

Other models could also have been used instead of support vector machines. Given the size of the dataset, some approaches typically requiring large datasets wouldn’t be promising. Neural networks is one example of this, although reducing the number of neurons, hidden layers and iterations would allow it to work for smaller datasets. This may require a lot more tweaking to get right compared to SVMs. There are other approaches that work with small datasets: Logistic regression only captures the relationship where there is some cutoff after which the label is true. Decision trees are more suited to datasets containing categorical or discrete data. Naive Bayes assumes independence between features and isn’t particularly well-suited to continuous data.

In [1]:
from collections import defaultdict, namedtuple

from numpy import mean, std
from pandas import read_csv
from scipy.stats import pearsonr
from sklearn.svm import SVC

def split_train_and_test(data, train_frac):
    """ Split data into training and test, with the training size equal to the
        fraction provided
    """
    pos = data.loc[data['class_label'] == 1.0]
    neg = data.loc[data['class_label'] == 0.0]
    shuffled_pos = pos.sample(frac=1)
    shuffled_neg = neg.sample(frac=1)
    pos_offset = int(len(pos)*train_frac)
    neg_offset = int(len(neg)*train_frac)
    train = shuffled_pos[:pos_offset].append(shuffled_neg[:neg_offset])
    test = shuffled_pos[pos_offset:].append(shuffled_neg[neg_offset:])
    return train, test

def extract_section(dataframe, index, count):
    """ Extracts section `index` of `count` sections from the dataframe. Returns
        (section, rest)
    """
    offset = int(len(dataframe) / count)
    start = index * offset
    end = (index + 1) * offset
    return dataframe[start:end], dataframe[:start].append(dataframe[end:])

IndexValue = namedtuple('IndexedValue', 'index value')

def correlation_importance(data):
    """ Returns the features sorted by the Pearson correlation coefficient from
        most to least important
    """
    correlations = []
    for i in range(10):
        corr, _ = pearsonr(data['class_label'], data[f'sensor{i}'])
        correlations.append(IndexValue(i, corr))
    return [x.index for x in sorted(correlations, key=lambda x: abs(x.value),
                                    reverse=True)]

def combined_importance(data, train_frac, invert):
    """ For each feature, train a model without that feature. Eliminate the
        feature which gave the lowest score on test data (or highest, if invert
        is set). Repeat until there is only one feature left. Returns the list
        of features removed, from most to least important
    """
    train, test = split_train_and_test(data, train_frac=train_frac)
    columns = [IndexValue(i, f'sensor{i}') for i in range(10)]
    removed = []
    while len(columns) > 1:
        best, bcol = -1 if invert else 2, None
        for col in columns:
            model = SVC(gamma='scale')
            model.fit(train[[c.value for c in columns if c != col]], train['class_label'])
            score = model.score(test[[c.value for c in columns if c != col]], test['class_label'])
            if (invert and score > best) or (not invert and score < best):
                best = score
                bcol = col
        removed.append(bcol.index)
        columns = [c for c in columns if c != bcol]
    removed.append(columns[0].index)
    if invert:
        removed.reverse()
    return removed

def individual_importance(data, cv_sections):
    """ Trains a model for each individual feature. The data is split into
        `cv_sections`, and the training is run multiple times with each section
        as a test set, with the rest used for the train set. The average score
        on the test sets is then used as the importance of a feature. Returns a
        iterable of (feature, [scores]), from most to least important
    """
    shuffled_pos = data.loc[data['class_label'] == 1.0].sample(frac=1)
    shuffled_neg = data.loc[data['class_label'] == 0.0].sample(frac=1)
    combined_scores = defaultdict(list)
    for test_i in range(cv_sections):
        test_pos, train_pos = extract_section(shuffled_pos, test_i, cv_sections)
        test_neg, train_neg = extract_section(shuffled_neg, test_i, cv_sections)
        train = train_pos.append(train_neg)
        test = test_pos.append(test_neg)
        scores = []
        for i in range(10):
            model = SVC(gamma='auto')
            model.fit(train[[f'sensor{i}']], train['class_label'])
            score = model.score(test[[f'sensor{i}']], test['class_label'])
            scores.append(IndexValue(i, score))
        scores.sort(key=lambda iv: iv.value, reverse=True)
        log(f"Individual importance {test_i}:", [iv.index for iv in scores],
            prefix="    ")
        for score_iv in scores:
            combined_scores[score_iv.index].append(score_iv.value)
    return sorted(combined_scores.items(), key=lambda kv: sum(kv[1]),
                  reverse=True)

def log(key, value, key_width=30, prefix=""):
    """ Print the given key and value, with spaces to the right of key up to
        `key_width`
    """
    if prefix:
        print(prefix, key.ljust(key_width), value)
    else:
        print(key.ljust(key_width), value)

data = read_csv('data/sensor-data.csv')
data.loc[data['class_label'] == -1, 'class_label'] = 0

combined_scores = individual_importance(data, cv_sections=4)
log("Individual percentages (mean):",
    "[%s]" % ", ".join("%2.2f" % mean(x[1]) for x in combined_scores))
log("Individual percentages (std):",
    "[%s]" % ", ".join("%2.2f" % std(x[1]) for x in combined_scores))
print()
log("Final ranking:", [x[0] for x in combined_scores])
with open('ranking.txt', 'w') as file:
    file.write("\n".join(f"sensor{score[0]}" for score in combined_scores))

print()
print("Other measures:")
log("Combined importance:", combined_importance(data, 0.75, invert=False))
log("Combined invert importance:",
    combined_importance(data, 0.75, invert=True))
log("Correlation:", correlation_importance(data))


     Individual importance 0:       [6, 8, 4, 2, 0, 3, 9, 5, 1, 7]
     Individual importance 1:       [6, 2, 8, 0, 4, 3, 5, 9, 7, 1]
     Individual importance 2:       [6, 8, 4, 0, 2, 3, 9, 5, 7, 1]
     Individual importance 3:       [6, 8, 4, 2, 0, 9, 3, 7, 1, 5]
Individual percentages (mean): [0.97, 0.88, 0.84, 0.80, 0.79, 0.72, 0.66, 0.56, 0.54, 0.48]
Individual percentages (std):  [0.02, 0.02, 0.01, 0.06, 0.03, 0.01, 0.05, 0.05, 0.04, 0.03]

Final ranking:                 [6, 8, 4, 2, 0, 3, 9, 5, 7, 1]

Other measures:
Combined importance:           [0, 6, 8, 4, 2, 1, 5, 3, 9, 7]
Combined invert importance:    [6, 7, 9, 4, 1, 5, 2, 3, 8, 0]
Correlation:                   [8, 4, 0, 3, 1, 5, 7, 9, 2, 6]
