# KMeans

I'm going to use faomus dataset UCI's abalone for predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.  Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

Variable description is:

    Name            Data Type   Meas.   Description
    ----            ---------   -----   -----------
    Sex                nominal          M, F, and I (infant)
    Length          continuous     mm   Longest shell measurement
    Diameter        continuous     mm   perpendicular to length
    Height          continuous     mm   with meat in shell
    Whole weight    continuous   grams  whole abalone
    Shucked weight  continuous   grams  weight of meat
    Viscera weight  continuous   grams  gut weight (after bleeding)
    Shell weight    continuous   grams  after being dried
    Rings              integer          +1.5 gives the age in years

In [80]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

df=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', sep=',',header=None)

In [81]:
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


In [82]:
# Sex variable set to numeric values
df[0] = pd.factorize(df[0])[0]

df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,2,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,2,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,1,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,1,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,0,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,1,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


In [83]:
# Train-Test split
X = np.asarray(df[df.columns[:-1]])
Y = np.asarray(df[df.columns[8]])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.75)

## SVM

In [84]:
# Create a SVC classifier using an RBF kernel
clf = SVC(kernel='rbf', random_state=0, gamma=1, C=1)
# Train the classifier

#clf = svm.LinearSVC()
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)
print(classification_report(Y_test, pred))

             precision    recall  f1-score   support

          3       0.00      0.00      0.00         5
          4       0.00      0.00      0.00        10
          5       0.38      0.64      0.48        25
          6       0.51      0.24      0.33        74
          7       0.35      0.35      0.35        95
          8       0.34      0.28      0.30       133
          9       0.21      0.56      0.30       178
         10       0.25      0.25      0.25       189
         11       0.37      0.27      0.32       124
         12       0.00      0.00      0.00        66
         13       0.00      0.00      0.00        47
         14       0.00      0.00      0.00        25
         15       0.00      0.00      0.00        18
         16       0.00      0.00      0.00        15
         17       0.00      0.00      0.00        13
         18       0.00      0.00      0.00         6
         19       0.00      0.00      0.00         9
         20       0.00      0.00      0.00   

## Kmeans

In [85]:
# age range goes 
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)

label = kmeans.labels_

In [86]:
# I add cluster variable to data frame to use it in classifying another svm
df[9] = df[8]
df[8] = label
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,6,15
1,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,6,7
2,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,5,9
3,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,6,10
4,2,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,8,7


In [87]:
# Train-Test split
X = np.asarray(df[df.columns[:-1]])
Y = np.asarray(df[df.columns[9]])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.75)

In [88]:
# Create a SVC classifier using an RBF kernel
clf = SVC(kernel='rbf', random_state=0, gamma=1, C=1)
# Train the classifier

#clf = svm.LinearSVC()
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)
print(classification_report(Y_test, pred))

             precision    recall  f1-score   support

          3       0.00      0.00      0.00         6
          4       0.00      0.00      0.00        18
          5       0.23      0.28      0.25        25
          6       0.34      0.46      0.39        63
          7       0.30      0.24      0.27        90
          8       0.32      0.26      0.29       137
          9       0.21      0.40      0.28       183
         10       0.21      0.46      0.29       144
         11       0.35      0.20      0.25       143
         12       0.67      0.03      0.06        67
         13       0.00      0.00      0.00        43
         14       0.00      0.00      0.00        32
         15       0.00      0.00      0.00        32
         16       0.00      0.00      0.00        14
         17       0.00      0.00      0.00        15
         18       0.00      0.00      0.00        10
         19       0.00      0.00      0.00         6
         20       0.00      0.00      0.00   

In [89]:
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
label = kmeans.labels_
df[8] = label
df.head(5)
# Train-Test split
X = np.asarray(df[df.columns[:-1]])
Y = np.asarray(df[df.columns[9]])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.75)
# Create a SVC classifier using an RBF kernel
clf = SVC(kernel='rbf', random_state=0, gamma=1, C=1)
# Train the classifier

#clf = svm.LinearSVC()
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)
print(classification_report(Y_test, pred))

             precision    recall  f1-score   support

          3       0.00      0.00      0.00         4
          4       0.00      0.00      0.00        17
          5       0.28      0.56      0.37        25
          6       0.00      0.00      0.00        79
          7       0.35      0.53      0.42        93
          8       0.37      0.22      0.27       137
          9       0.19      0.46      0.27       168
         10       0.24      0.34      0.28       163
         11       0.18      0.21      0.19       111
         12       0.00      0.00      0.00        73
         13       0.00      0.00      0.00        51
         14       0.00      0.00      0.00        33
         15       0.00      0.00      0.00        27
         16       0.00      0.00      0.00        13
         17       0.00      0.00      0.00        17
         18       0.00      0.00      0.00        11
         19       0.00      0.00      0.00         8
         20       0.00      0.00      0.00   

Precision didn't improve on any of the two clustering numbers (10 and 20).