Importing standard libraries (numpy, matplotlib).

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Loading data

In [2]:
#this method seems most robust with loading CSV data
from pandas import read_csv

data=read_csv('multi-label.csv') #load CSV
num_samp=data.shape[0] #first dim -> number of samples in DB
num_feat=data.shape[1] #second dim -> number of features per sample
num_feat-=2 #remove first and last column from count

In [3]:
print 'Nuber of samples: {}'.format(num_samp)
print 'Nuber of features: {}'.format(num_feat)

Nuber of samples: 75
Nuber of features: 97


### Converting data for classification

We need two matrices:

  * input matrix - dimensions [num_samp x num_feat]
  * output matrix - dimensions [num_samp x num_cls]

In [4]:
inp=data.ix[:,1:-1].as_matrix() #extract all values, except first (file name) and last (class)
out_str=data.ix[:,-1].values.tolist() #extract class column only

#convert list of class strings into list-of-lists
out_lstr=[]
for s in out_str:
    out_lstr.append(s.split('+')) #split each string on '+' sign

cls=sorted(set([i for j in out_lstr for i in j])) #sort and uniq all elements of the above list-of-lists
num_cls=len(cls) #calculate number of classes
cls_ind=dict(zip(cls,range(num_cls))) #create a map between class name -> class index

#create a binary matrix of multilable classes
out=np.zeros((num_samp,num_cls))
for si in range(num_samp):
    for c in out_lstr[si]:
        out[si,cls_ind[c]]=1

## Toy sample set

In [5]:
# from sklearn.datasets import make_multilabel_classification

# inp, out = make_multilabel_classification(n_samples=77, n_features=20, n_classes=7, n_labels=2, allow_unlabeled=False, random_state=1)

## Cross validation

Since we have little data, we will use cross validation. Also, the experiment will be repeated several times to account for repeatability of the score.

In [6]:
from sklearn.model_selection import cross_val_score

#run 'num' number of cross validation experiments with 'cv' folds and print mean/std on output
def run(classifier,cv=7,num=10):
    scores=[]
    for i in range(num):
        scores.append(cross_val_score(clf,inp,out,cv=cv))
    scores=np.array(scores).flatten()
    print 'ACC: {:%} (+/- {:%})'.format(scores.mean(),scores.std()*2)

## Experiments

### K-NN

In [7]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)
run(clf)

ACC: 4.155844% (+/- 14.377509%)


### Decision tree

In [8]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
run(clf)

ACC: 7.766234% (+/- 20.569658%)


StandardScaler is used to normalize the input data -> make all features (independently) have zero mean and unit variance across all (training) samples. 

In [9]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

clf = make_pipeline(StandardScaler(), DecisionTreeClassifier())
run(clf)

ACC: 6.714286% (+/- 17.957822%)


In [10]:
from sklearn.decomposition import PCA

clf = make_pipeline(PCA(), DecisionTreeClassifier())
run(clf)

ACC: 6.870130% (+/- 17.378720%)


### Random forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
run(clf)

ACC: 4.090909% (+/- 14.463242%)


### MLP

In [12]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier()
run(clf)

ACC: 2.636364% (+/- 13.039401%)


### SVM

This one doesn't support multi-label by default, but we can use 1-vs-rest strategy to get around this limitation.

In [13]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

clf = OneVsRestClassifier(SVC())
run(clf)

  str(classes[c]))


ACC: 0.000000% (+/- 0.000000%)
