# Exploratory Analysis - Stage 1

As a first step to solving the problem of leaf classification, we should get a general understanding of the data. Let's dive in!

In [1]:
import numpy as np
import pandas as pd

In [2]:
df_train_dataset = pd.read_csv('../data/train.csv')
df_test_dataset = pd.read_csv('../data/test.csv')

In [3]:
df_train_dataset.head()

Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,Acer_Opalus,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,Pterocarya_Stenoptera,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,Quercus_Hartwissiana,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293
3,5,Tilia_Tomentosa,0.0,0.003906,0.023438,0.005859,0.021484,0.019531,0.023438,0.0,...,0.0,0.000977,0.0,0.0,0.020508,0.0,0.0,0.017578,0.0,0.047852
4,6,Quercus_Variabilis,0.005859,0.003906,0.048828,0.009766,0.013672,0.015625,0.005859,0.0,...,0.09668,0.0,0.021484,0.0,0.0,0.0,0.0,0.0,0.0,0.03125


In [4]:
print(df_train_dataset.shape)
print(df_test_dataset.shape)

(990, 194)
(594, 193)


So our training data has 990 samples of 194 features, and the testing data has 594 samples of 193 features. There are 16 samples of each of the 99 classes, so I assume 10 of each species is in the training set and 6 are in the testing set. I also assume the column missing from the test dataset is species. Let's verify these assumptions!

In [5]:
df_train_dataset.groupby('species').count()['id'].head()
# I used head to trim, but it was all there

species
Acer_Capillipes    10
Acer_Circinatum    10
Acer_Mono          10
Acer_Opalus        10
Acer_Palmatum      10
Name: id, dtype: int64

In [6]:
print('species' in df_test_dataset.columns)

False


In [7]:
sorted_ids = sorted( list( df_train_dataset['id'] ) + list( df_test_dataset['id'] ))

a = []
for i in range(1,1585):
    a.append(i)
    
print(a==sorted_ids)

True


What kind of values are stored in the pandas dataframes? Lets check the features:

In [8]:
df_train_dataset.describe()

Unnamed: 0,id,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
count,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,...,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0
mean,799.59596,0.017412,0.028539,0.031988,0.02328,0.014264,0.038579,0.019202,0.001083,0.007167,...,0.036501,0.005024,0.015944,0.011586,0.016108,0.014017,0.002688,0.020291,0.008989,0.01942
std,452.477568,0.019739,0.038855,0.025847,0.028411,0.01839,0.05203,0.017511,0.002743,0.008933,...,0.063403,0.019321,0.023214,0.02504,0.015335,0.060151,0.011415,0.03904,0.013791,0.022768
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,415.25,0.001953,0.001953,0.013672,0.005859,0.001953,0.0,0.005859,0.0,0.001953,...,0.0,0.0,0.000977,0.0,0.004883,0.0,0.0,0.0,0.0,0.000977
50%,802.5,0.009766,0.011719,0.025391,0.013672,0.007812,0.015625,0.015625,0.0,0.005859,...,0.004883,0.0,0.005859,0.000977,0.012695,0.0,0.0,0.003906,0.00293,0.011719
75%,1195.5,0.025391,0.041016,0.044922,0.029297,0.017578,0.056153,0.029297,0.0,0.007812,...,0.043701,0.0,0.022217,0.009766,0.021484,0.0,0.0,0.023438,0.012695,0.029297
max,1584.0,0.087891,0.20508,0.15625,0.16992,0.11133,0.31055,0.091797,0.03125,0.076172,...,0.42969,0.20215,0.17285,0.2002,0.10645,0.57813,0.15137,0.37598,0.086914,0.1416


In [9]:
cols = df_train_dataset.columns[2:]

print("Examination of minimum, mean, and maximum values for each column:")
print("\ncolumn  \tminimum \tmean    \tmaximum")

for col in cols:
    print(col, '\t', df_train_dataset[col].min(), '\t', df_train_dataset[col].mean(), '\t', df_train_dataset[col].max())

Examination of minimum, mean, and maximum values for each column:

column  	minimum 	mean    	maximum
margin1 	 0.0 	 0.0174123585858586 	 0.087891
margin2 	 0.0 	 0.028539285858585922 	 0.20508
margin3 	 0.0 	 0.031987842424242394 	 0.15625
margin4 	 0.0 	 0.023279618181818222 	 0.16992
margin5 	 0.0 	 0.014263668686868708 	 0.11133
margin6 	 0.0 	 0.038579214141414214 	 0.31055
margin7 	 0.0 	 0.01920174343434347 	 0.091797
margin8 	 0.0 	 0.0010830373737373711 	 0.03125
margin9 	 0.0 	 0.007167232323232346 	 0.076172
margin10 	 0.0 	 0.018639494949494954 	 0.097656
margin11 	 0.0 	 0.024208851515151553 	 0.125
margin12 	 0.0 	 0.011975161616161667 	 0.052734
margin13 	 0.0 	 0.041252416161616234 	 0.38867
margin14 	 0.0 	 0.008053108080808115 	 0.082031
margin15 	 0.0 	 0.015609166666666719 	 0.064453
margin16 	 0.0 	 0.00011047676767676767 	 0.015625
margin17 	 0.0 	 0.015127809090909093 	 0.060547
margin18 	 0.0 	 0.02010729191919196 	 0.22852
margin19 	 0.0 	 0.012344073737373756

In [10]:
margin_mean = 0
shape_mean = 0
texture_mean = 0

for i in range(64):
    margin_mean += df_train_dataset['margin'+str(i+1)].mean()
    shape_mean += df_train_dataset['shape'+str(i+1)].mean()
    texture_mean += df_train_dataset['texture'+str(i+1)].mean()
    
print(margin_mean/64)
print(shape_mean/64)
print(texture_mean/64)

0.015624954640151538
0.0006066120253945707
0.01562502228535353


Some classifiers won't work with unscaled features. Let's scale them all to the range [0,1]

In [11]:
for col in cols:
    # Rescale train dataset
    minimum = df_train_dataset[col].min()
    maximum = df_train_dataset[col].max()
    df_train_dataset[col] = (df_train_dataset[col] - minimum) / (maximum - minimum)
    
    # Rescale test dataset
    minimum = df_test_dataset[col].min()
    maximum = df_test_dataset[col].max()
    df_test_dataset[col] = (df_test_dataset[col] - minimum) / (maximum - minimum)

In [12]:
print("Verification of rescaled features:")
print("\ncolumn  \tminimum \tmean    \tmaximum")

for col in cols[:10]:
    print(col, '\t', df_train_dataset[col].min(), '\t', df_train_dataset[col].mean(), '\t', df_train_dataset[col].max())

Verification of rescaled features:

column  	minimum 	mean    	maximum
margin1 	 0.0 	 0.1981131012943151 	 1.0
margin2 	 0.0 	 0.13916172156517356 	 1.0
margin3 	 0.0 	 0.20472219151515064 	 1.0
margin4 	 0.0 	 0.13700340267077615 	 1.0
margin5 	 0.0 	 0.12812062055931717 	 1.0
margin6 	 0.0 	 0.12422867216684641 	 1.0
margin7 	 0.0 	 0.20917615427893563 	 1.0
margin8 	 0.0 	 0.03465719595959588 	 1.0
margin9 	 0.0 	 0.09409274173229452 	 1.0
margin10 	 0.0 	 0.19086891690725546 	 1.0


Let's now convert the species column from the train dataset to numeric values:

In [13]:
sample_sub = '../submissions/sample_submission.csv'
df_sample = pd.read_csv(sample_sub)

submission_columns = df_sample.columns
species_names = list(submission_columns[1:])
print(species_names)

df_train_dataset['species_num'] = df_train_dataset['species'].apply(lambda x: species_names.index(x))

['Acer_Capillipes', 'Acer_Circinatum', 'Acer_Mono', 'Acer_Opalus', 'Acer_Palmatum', 'Acer_Pictum', 'Acer_Platanoids', 'Acer_Rubrum', 'Acer_Rufinerve', 'Acer_Saccharinum', 'Alnus_Cordata', 'Alnus_Maximowiczii', 'Alnus_Rubra', 'Alnus_Sieboldiana', 'Alnus_Viridis', 'Arundinaria_Simonii', 'Betula_Austrosinensis', 'Betula_Pendula', 'Callicarpa_Bodinieri', 'Castanea_Sativa', 'Celtis_Koraiensis', 'Cercis_Siliquastrum', 'Cornus_Chinensis', 'Cornus_Controversa', 'Cornus_Macrophylla', 'Cotinus_Coggygria', 'Crataegus_Monogyna', 'Cytisus_Battandieri', 'Eucalyptus_Glaucescens', 'Eucalyptus_Neglecta', 'Eucalyptus_Urnigera', 'Fagus_Sylvatica', 'Ginkgo_Biloba', 'Ilex_Aquifolium', 'Ilex_Cornuta', 'Liquidambar_Styraciflua', 'Liriodendron_Tulipifera', 'Lithocarpus_Cleistocarpus', 'Lithocarpus_Edulis', 'Magnolia_Heptapeta', 'Magnolia_Salicifolia', 'Morus_Nigra', 'Olea_Europaea', 'Phildelphus', 'Populus_Adenopoda', 'Populus_Grandidentata', 'Populus_Nigra', 'Prunus_Avium', 'Prunus_X_Shmittii', 'Pterocarya_S

If the transformation worked, there should be 10 of each class in `species_num`:

In [14]:
for i in range(99):
    assert(len(df_train_dataset[df_train_dataset['species_num']==i].index) == 10)

In [15]:
if 'species' in df_train_dataset.columns:
    df_train_dataset = df_train_dataset.drop(['species'], axis=1)

Now that we have transformed data, let's see how to format a submission:

In [16]:
df_sample.head()

Unnamed: 0,id,Acer_Capillipes,Acer_Circinatum,Acer_Mono,Acer_Opalus,Acer_Palmatum,Acer_Pictum,Acer_Platanoids,Acer_Rubrum,Acer_Rufinerve,...,Salix_Fragilis,Salix_Intergra,Sorbus_Aria,Tilia_Oliveri,Tilia_Platyphyllos,Tilia_Tomentosa,Ulmus_Bergmanniana,Viburnum_Tinus,Viburnum_x_Rhytidophylloides,Zelkova_Serrata
0,4,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,...,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101
1,7,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,...,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101
2,9,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,...,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101
3,12,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,...,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101
4,13,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,...,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101,0.010101


In [17]:
print(df_sample.shape)

(594, 100)


It appears that for each test id, we have to give a probability that that leaf belongs to any one of the species. Let's pickle away the transformed datasets for later use:

In [18]:
train_pickle_file = '../data/pickles/train_data.pkl'
test_pickle_file = '../data/pickles/test_data.pkl'

df_train_dataset.to_pickle(train_pickle_file)
df_test_dataset.to_pickle(test_pickle_file)