# Leaf classification using logistic regression

There are 99 species of leaves. 
 
## Data fields

There are already pre-extracted features provided.
 
Three sets of features: a shape contiguous descriptor, an interior texture histogram, and a ﬁne-scale margin histogram. For each feature, a 64-attribute vector is given per leaf sample.

| Column      | Description  |
|-------------|---|
|  id   | an anonymous id unique to an image  |
|  margin_1, margin_2, margin_3, ..., margin_64     |  each of the 64 attribute vectors for the margin feature|
| shape_1, shape_2, shape_3, ..., shape_64 | each of the 64 attribute vectors for the shape feature |
| texture_1, texture_2, texture_3, ..., texture_64 | each of the 64 attribute vectors for the texture feature |

## Evaluation

Submissions are evaluated using the multi-class logarithmic loss. Each image has been labeled with one true species. For each image, you must submit a set of predicted probabilities (one for every species). 

#### The dataset is from kaggle.com:   https://www.kaggle.com/c/leaf-classification/data

In [1]:
import pandas as pd

train_data=pd.read_csv('train.csv')
test_data=pd.read_csv('test.csv')

train_data[:3]

Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,Acer_Opalus,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,Pterocarya_Stenoptera,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,Quercus_Hartwissiana,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293


In [3]:
import numpy as np
from sklearn.preprocessing import scale, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

X_train=scale(train_data.drop(['id','species'], 1))
X_test=scale(test_data.drop(['id'], 1))

encoder=LabelEncoder().fit(train_data["species"])
y=encoder.transform( train_data["species"] )

params = {'C':[100, 500, 1500, 3000]}
log_reg = LogisticRegression()
clf = GridSearchCV(log_reg,params, scoring='neg_log_loss', n_jobs=-1, cv=5)    
model= clf.fit(X_train, y)

print model.cv_results_['mean_train_score']
print model.cv_results_['mean_test_score']

[-0.01639698 -0.00399972 -0.00151607 -0.00080714]
[-0.16255468 -0.14150859 -0.13950822 -0.14104502]


In [4]:
pred=clf.predict_proba(X_test)
predictions=pd.DataFrame(pred, columns=encoder.classes_, index=test_data['id'])

predictions.to_csv(path_or_buf='classifications.csv', header=True, 
                   index=True, index_label='id')
print 'Done.'

Done.
