# Leaf classification using unsupervised learning 
## KMeans clustering

The data has a labeled training set and a testing set. 
**I will not use the labels in training, but only in scoring the trained model.**

There are 99 species of leaves. 
There will be 1 cluster for each species.
 
## Data fields

There are already pre-extracted features provided.
 
Three sets of features: a shape contiguous descriptor, an interior texture histogram, and a ﬁne-scale margin histogram. For each feature, a 64-attribute vector is given per leaf sample.

| Column      | Description  |
|-------------|---|
|  id   | an anonymous id unique to an image  |
|  margin_1, margin_2, margin_3, ..., margin_64     |  each of the 64 attribute vectors for the margin feature|
| shape_1, shape_2, shape_3, ..., shape_64 | each of the 64 attribute vectors for the shape feature |
| texture_1, texture_2, texture_3, ..., texture_64 | each of the 64 attribute vectors for the texture feature |

## Evaluation

Submissions are evaluated using the multi-class logarithmic loss. Each image has been labeled with one true species. For each image, you must submit a set of predicted probabilities (one for every species). 

#### The dataset is from kaggle.com:   https://www.kaggle.com/c/leaf-classification/data

In [3]:
import pandas as pd

train_data=pd.read_csv('train.csv')
test_data=pd.read_csv('test.csv')

train_data[:3]

Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,Acer_Opalus,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,Pterocarya_Stenoptera,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,Quercus_Hartwissiana,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293


### Algorithm for creating the classification probabilities (weights) for clusters

 1. I am putting both labeled and unlabeled data together.
 2. I am fitting it into clusters.
 3. Then, for each cluster, I use the labeled entries to count how many species are bundled together.
    - If the count is 1, then all the unlabeled data get's the same label as the labeled data.
    - If there are more, then the probability is split by the counts of each labeled species. *Ex: 2 entries with species A, and 1 entry with species B -> prob(A)=66.7%, prob(B)=33.3%*

In [4]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing

def getAllFeatures(data):
    return data.drop('id', 1).drop('species', 1)

def fit_model(X):
    X=preprocessing.scale(X)
    
    model=KMeans(n_clusters=99, n_jobs=-1).fit(X)
    
    return model

def create_species_weights_for_clusters(species, labels, error):    
    y=pd.DataFrame()
    y['species']=species
    y['cluster_id']=labels
    y['amount']=np.ones(len(species))
    
    catalogue=y.pivot(columns='species', values='amount').fillna(0)
    catalogue['cluster_id']=labels
    catalogue=catalogue.groupby('cluster_id').sum()
    
    # to improve our score if we miss-classify 
    # we add a small probability to all the weights
    catalogue=catalogue.replace(0,error)    

    # we scale the probabilities to sum to 1 per cluster_id
    c_sum=catalogue.sum(axis=1)      
    weights=catalogue.apply(lambda x: x / c_sum)     
    return weights
    
def getPredictions(model, weights, test_data):
    predictions=pd.DataFrame()
    predictions['id']=test_data['id']
    predictions['cluster_id']=test_data['labels'] 
    
    return pd.merge(predictions, weights, left_on='cluster_id', right_index=True
                   ).drop('cluster_id', 1)

In [5]:
# implementation of the multi-class logarithmic loss
def score(predictions,y_test):
    y=pd.DataFrame()
    y['species']=y_test
    y['amount']=np.ones(len(y_test))
    catalogue=y.pivot(columns='species', values='amount').fillna(0)
    
    pred=predictions.drop('id', 1).replace(0, 10**-15).apply(lambda x: np.log(x))    
    return pred.mul(catalogue).sum().sum() / len(pred) * -1
        
def predict_using_both_train_test_data(train_data, test_data, error):    
    test_data['species']='-1'
    all_data=train_data.append(test_data)    

    model=fit_model( getAllFeatures(all_data))
    all_data['labels']=model.labels_

    train_index=all_data['species'] != '-1'
    weights=create_species_weights_for_clusters(all_data[train_index]['species'],
                                                all_data[train_index]['labels'],
                                                error)

    test_index=all_data['species'] == '-1'
    predictions=getPredictions(model, weights, all_data[test_index].drop('species',1))

    if(len(predictions) != len(test_data)):
        print "[WARNING] There are clusters without labelled data"        
    
    return predictions

In [8]:
from sklearn.model_selection import KFold

def CV(error):
    nr=10    
    sc=0
    kf = KFold(n_splits=nr)

    for train_index, test_index in kf.split(train_data):
        
        predictions=predict_using_both_train_test_data(
                                train_data.iloc[train_index], 
                                train_data.iloc[test_index].drop('species',1),
                                error)        
        new_score=score(predictions, train_data.iloc[test_index]['species'])
        sc=sc+new_score
    return sc/nr

print str( CV(0.001) )

0.480443989959


In [9]:
# train on all the data
predictions=predict_using_both_train_test_data( train_data, test_data, 0.001)

predictions.to_csv(path_or_buf='classifications.csv', header=True, index=False)
print 'Done.'

Done.
