# How many clusters of grain?

This exercise is taken and modified from https://github.com/benjaminwilson/python-clustering-exercises

This is a class to choose a good number of clusters for a dataset using the k-means inertia graph.  You are given a dataset of the measurements of samples of grain.  What's a good number of clusters in this case?

This dataset was obtained from the [UCI](https://archive.ics.uci.edu/ml/datasets/seeds).


**Step 1:** Load the dataset _(written for you)_.

In [13]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

In [3]:
seeds_df = pd.read_csv('Data/seeds.csv')
# forget about the grain variety for the moment - we'll use this later

**Step 2:** Display the DataFrame to inspect the data.  Notice that there are 7 columns - so each grain sample (row) is a point in 7D space!  Scatter plots can't help us here.

In [4]:
seeds_df.head(10)

Unnamed: 0,area,perimeter,compactness,length,width,asymmetry_coefficient,groove_length,grain_variety
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,Kama wheat
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,Kama wheat
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,Kama wheat
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,Kama wheat
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,Kama wheat
5,14.38,14.21,0.8951,5.386,3.312,2.462,4.956,Kama wheat
6,14.69,14.49,0.8799,5.563,3.259,3.586,5.219,Kama wheat
7,14.11,14.1,0.8911,5.42,3.302,2.7,5.0,Kama wheat
8,16.63,15.46,0.8747,6.053,3.465,2.04,5.877,Kama wheat
9,16.44,15.25,0.888,5.884,3.505,1.969,5.533,Kama wheat


**Step 3:** Extract the measurements from the DataFrame using its `.values` attribute:

In [5]:
dataset = seeds_df.values

In [6]:
dataset

array([[15.26, 14.84, 0.871, ..., 2.221, 5.22, 'Kama wheat'],
       [14.88, 14.57, 0.8811, ..., 1.018, 4.956, 'Kama wheat'],
       [14.29, 14.09, 0.905, ..., 2.699, 4.825, 'Kama wheat'],
       ...,
       [13.2, 13.66, 0.8883, ..., 8.315, 5.056, 'Canadian wheat'],
       [11.84, 13.21, 0.8521, ..., 3.5980000000000003, 5.044,
        'Canadian wheat'],
       [12.3, 13.34, 0.8684, ..., 5.6370000000000005, 5.063,
        'Canadian wheat']], dtype=object)

In [11]:
dataset.shape

(210, 8)

In [7]:
#Assign all columns except the last column (target column) to X variable
X = dataset[:,0:7].astype(float)

In [8]:
#Assign the last column only to Y variable
Y = dataset[:,7]

In [10]:
print(X)
print(Y)

[[15.26   14.84    0.871  ...  3.312   2.221   5.22  ]
 [14.88   14.57    0.8811 ...  3.333   1.018   4.956 ]
 [14.29   14.09    0.905  ...  3.337   2.699   4.825 ]
 ...
 [13.2    13.66    0.8883 ...  3.232   8.315   5.056 ]
 [11.84   13.21    0.8521 ...  2.836   3.598   5.044 ]
 [12.3    13.34    0.8684 ...  2.974   5.637   5.063 ]]
['Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat' 'Kama wheat'
 'Ka

### Encode the target variable

In [14]:
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#encode according to alphabet

In [15]:
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

In [None]:
#Don't have to use correlation, heat map for classification method
#for random seed proble, reproduce the seed and make it fixed
#normalizing value gives more impact -> result to high accuracy
#epochs run many times in our training

### Define the Model

In [17]:
# define baseline model
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(8, input_dim=7, activation='relu')) #use relu because we have hidden layers
	model.add(Dense(3, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

### Evaluate the model using K-Fold

In [18]:
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)

  estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)


In [19]:
kfold = KFold(n_splits=10, shuffle=True)

In [20]:
results = cross_val_score(estimator, X, dummy_y, cv=kfold)

In [21]:
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 90.95% (3.33%)


In [24]:
# define baseline model the new one tu increase the accuracy
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(16, input_dim=7, activation='relu')) #double the hidden nodes
	model.add(Dense(3, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

In [25]:
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)
kfold = KFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

  estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)


Baseline: 91.43% (4.67%)
