This program perfroms clustering analysis of 80 Golf Courses, and attempts to cluster them using basic metrics (e.g., the sum of distances from tee to hole) and more advanced metric (e.g., difficulty of putting measured through the advanced metric "Strokes Gained")

In [14]:
import numpy, pandas, scipy, sklearn 
from sklearn.preprocessing import StandardScaler as ss
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.decomposition import PCA

Let's load some data!

In [15]:
df = pandas.read_csv(r'C:\Users\lomba\Downloads\dg_course_table - dg_course_table.csv')

We'll take the lay of the land and see the table's column names and shape.

In [21]:
print(df.columns)
print(df.shape)
cols = list(df.columns[1:])

Index(['course', 'par', 'yardage', 'yardage_4_5', 'yardage_3',
       'adj_score_to_par', 'adj_par_3_score', 'adj_par_4_score',
       'adj_par_5_score', 'adj_driving_distance', 'adj_sd_distance',
       'adj_driving_accuracy', 'fw_width', 'miss_fw_pen_frac', 'putt_sg',
       'arg_sg', 'app_sg', 'ott_sg', 'adj_gir', 'less_150_sg',
       'greater_150_sg', 'arg_fairway_sg', 'arg_rough_sg', 'arg_bunker_sg',
       'less_5_ft_sg', 'greater_5_less_15_sg', 'greater_15_sg'],
      dtype='object')
(80, 27)


We won't need the name of the course, so let's drop it from our potential feature set. We won't use this variable subset of a dataframe in our eventual program, however. We just want to take a quick scan and confirm that all the data is processable by the clustering algorithm!

In [23]:
features = df.iloc[0:, 1:]
print(features.dtypes)
features.head(2)

par                       int64
yardage                   int64
yardage_4_5               int64
yardage_3                 int64
adj_score_to_par        float64
adj_par_3_score         float64
adj_par_4_score         float64
adj_par_5_score         float64
adj_driving_distance    float64
adj_sd_distance         float64
adj_driving_accuracy    float64
fw_width                float64
miss_fw_pen_frac        float64
putt_sg                 float64
arg_sg                  float64
app_sg                  float64
ott_sg                  float64
adj_gir                 float64
less_150_sg             float64
greater_150_sg          float64
arg_fairway_sg          float64
arg_rough_sg            float64
arg_bunker_sg           float64
less_5_ft_sg            float64
greater_5_less_15_sg    float64
greater_15_sg           float64
dtype: object


Unnamed: 0,par,yardage,yardage_4_5,yardage_3,adj_score_to_par,adj_par_3_score,adj_par_4_score,adj_par_5_score,adj_driving_distance,adj_sd_distance,...,ott_sg,adj_gir,less_150_sg,greater_150_sg,arg_fairway_sg,arg_rough_sg,arg_bunker_sg,less_5_ft_sg,greater_5_less_15_sg,greater_15_sg
0,71,7241,463,190,0.68,0.09,0.14,-0.41,285.4,19.9,...,-0.036,0.603,-0.002,0.021,0.021,0.049,-0.063,-0.012,-0.035,-0.016
1,71,6947,444,183,-0.02,0.13,0.04,-0.34,274.6,19.5,...,-0.062,0.6215,0.008,-0.038,0.047,0.08,0.009,-0.001,-0.013,0.001


Alright! All floats and integers - we're good to go! We'll create a pair of for loops that scale all possible features, cycle through various groups of features and different numbers of clusters, and prints out the optimal feature set and k.

In [13]:
max_sil_vals = []
feature_selection = []
for i in range(2, len(df.columns)):
    features = ss().fit_transform(df.iloc[:, [1, i]])
    sil_vals = []
    for i in range(2, 5):
        km = KMeans(n_clusters=i).fit_predict(features)
        silhouette_val = silhouette_samples(features, km)
        silhouette_val = silhouette_val.mean()
        sil_vals.append(silhouette_val)
        max_sil_vals.append(silhouette_val)
        print(f"For {sil_vals.index(silhouette_val) + 2} clusters the average sillhouette coeffecient is {silhouette_val}")
    feature_selection.append(max(sil_vals))
    print(f"The optimal number of clusters is {sil_vals.index(max(sil_vals)) + 2} with an average sillhouette coeffecient of {max(sil_vals)}")
print(f'''

The most effective slice of features was features 1 through {feature_selection.index(max(feature_selection)) + 2}. 
That slice, when broken down into {sil_vals.index(max(sil_vals)) + 2} clusters, had an average sillhouette coeffecient of {max(max_sil_vals)}''')

For 2 clusters the average sillhouette coeffecient is 0.44216626841372014
For 3 clusters the average sillhouette coeffecient is 0.4360581539023505
For 4 clusters the average sillhouette coeffecient is 0.4706538873644788
The optimal number of clusters is 4 with an average sillhouette coeffecient of 0.4706538873644788
For 2 clusters the average sillhouette coeffecient is 0.4381572804548671
For 3 clusters the average sillhouette coeffecient is 0.42828611352538404
For 4 clusters the average sillhouette coeffecient is 0.42260661862390875
The optimal number of clusters is 2 with an average sillhouette coeffecient of 0.4381572804548671
For 2 clusters the average sillhouette coeffecient is 0.39753927088636376
For 3 clusters the average sillhouette coeffecient is 0.39646616634498305
For 4 clusters the average sillhouette coeffecient is 0.406004363176139
The optimal number of clusters is 4 with an average sillhouette coeffecient of 0.406004363176139
For 2 clusters the average sillhouette coeffec

And voila! We've got the best possible set of features and the optimal k! We could graph out the computer's process if we wanted to. Additionally, idf we had a little more computing power, we could go through the powerset of features and analyze it for optimal feature selection. However, that'd process would take my PC almost 15 years to complete!