## K-Means Clustering
The goal is to:
* use the contents of the existing df_rescaled_pitch_info.csv contained standardized information on each type of         pitch that they throw
* use all 217 features (we'll have to make that alterable to build a decent model?)
* and build a file that puts each pitcher in one cluster of "similar pitchers"

## Import all the needed libraries.

In [164]:
!pip install mxnet

import pandas as pd
import numpy as np
from datetime import datetime

import boto3
from sagemaker import get_execution_role
import sagemaker.amazon.common as smac
import mxnet as mx

[33mYou are using pip version 10.0.1, however version 20.1b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [165]:
%%html
<style>
.rendered_html tr, .rendered_html th, .rendered_html td {
  text-align: left;
}
# .rendered_html :first-child {
#   text-align: left;
# }
# .rendered_html :last-child {
#   text-align: left;
# }
</style>

## Loading the data from Amazon S3
Next, lets get the UFO sightings data that is stored in S3.

In [166]:
bucket = 'appleforge-merlin-develop-datalake'
prefix = 'pitchtype'
data_key = 'df_rescaled_pitch_info.csv'
data_location = 's3://{}/{}/{}'.format(bucket, prefix, data_key)

df = pd.read_csv(data_location, low_memory=False)

In [167]:
df.head()

Unnamed: 0,pitcher,p_throws,2f_ax,2f_ay,2f_az,2f_pct_usage,2f_pfx_x,2f_pfx_z,2f_plate_x,2f_plate_z,...,sl_release_extension,sl_release_pos_x,sl_release_pos_z,sl_release_speed,sl_release_spin_rate,sl_sz_bot,sl_sz_top,sl_vx0,sl_vy0,sl_vz0
0,282332,L,,,,0.0,,,,,...,1.465449,1.646795,0.060431,-1.306652,-0.646412,0.073547,-0.115519,-1.143983,1.268112,0.54595
1,407845,R,-0.545069,-0.007018,0.675023,0.632901,-0.529787,0.626222,-0.13174,0.097922,...,-0.143343,-0.676061,0.603197,0.553478,-0.474859,-0.3387,-0.718141,0.849285,-0.532922,-1.366876
2,424144,L,,,,0.0,,,,,...,-0.825003,1.696063,-1.122078,-1.501336,-0.378488,-0.258455,-0.586218,-1.454694,1.49609,1.513235
3,425772,R,,,,0.0,,,,,...,,,,,,,,,,
4,425794,R,,,,0.0,,,,,...,,,,,,,,,,


In [168]:
df.shape

(830, 218)

## Build Training Data
We're using the standardized pitch data. 
* All NA's are set to 0.
* Pitch handedness is now R=1, L=0

In [169]:
df_data_train = df.copy()
df_data_train['p_throws'] = df_data_train['p_throws'].map({'R': 1, 'L': 0})
df_data_train.fillna(0, inplace=True)
data_train = df_data_train.iloc[:,range(1, 218)].values.astype('float32')

## Create and train our model
[See the documentation of hyperparameters here](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means-api-config.html)

In [170]:
from sagemaker import KMeans

num_clusters = 10
output_location = 's3://' + bucket + '/pitchtype'
role = get_execution_role()

kmeans = KMeans(role=role,
               train_instance_count=1,
               train_instance_type='ml.c4.xlarge',
               output_path=output_location,
               k=num_clusters)

In [171]:
%%time
job_name = 'pitch-test-kmeans-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
labels = np.empty(len(df_data_train.index))
result_set = kmeans.record_set(data_train)#, labels=labels)
kmeans.fit(result_set, job_name=job_name)
print('Here is the job name {}'.format(job_name))

2020-04-27 01:12:10 Starting - Starting the training job...
2020-04-27 01:12:11 Starting - Launching requested ML instances......
2020-04-27 01:13:21 Starting - Preparing the instances for training...
2020-04-27 01:14:10 Downloading - Downloading input data......
2020-04-27 01:15:10 Training - Training image download completed. Training in progress.
2020-04-27 01:15:10 Uploading - Uploading generated training model.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/27/2020 01:15:07 INFO 140578399385408] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_enable_profiler': u'false', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u


2020-04-27 01:15:16 Completed - Training job completed
Training seconds: 66
Billable seconds: 66
Here is the job name pitch-test-kmeans-20200427011209
CPU times: user 522 ms, sys: 24.7 ms, total: 547 ms
Wall time: 3min 42s


# Deploy the Model and Make Predictions
From https://www.bmc.com/blogs/amazon-sagemaker/

In [172]:
%%time
kmeans_predictor = kmeans.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-----------------!CPU times: user 298 ms, sys: 9.17 ms, total: 307 ms
Wall time: 8min 32s


In [187]:
result = kmeans_predictor.predict(data_train)
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]
clusters[:10]

[2.0, 5.0, 2.0, 9.0, 0.0, 0.0, 5.0, 0.0, 1.0, 9.0]

# Output the Pitchers and their Cluster Membership

In [184]:
import s3fs

df_pitcher_membership = df[['pitcher']].copy()
df_pitcher_membership['cluster']=clusters

def _write_out_df(df, key):
    bytes_to_write = df.to_csv(None).encode()
    fs = s3fs.S3FileSystem()
    with fs.open('s3://{bucket}/{key}'.format(
            bucket=bucket_name,
            key=key
        ), 'wb') as f:
        f.write(bytes_to_write)
        
_write_out_df(df_pitcher_membership, f'pitchtype/{job_name}/pitcher_cluster_membership.csv')

# Viewing the results
Some more result viewing, like the details about the clusters

[See the documentation of deserilization here](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#td-deserialization)

In [185]:
import os
model_key = f'{prefix}/{job_name}/output/model.tar.gz'

boto3.resource('s3').Bucket(bucket).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')

2304

In [186]:
kmeans_model_params = mx.ndarray.load('model_algo-1')
cluster_centroids_kmeans = pd.DataFrame(kmeans_model_params[0].asnumpy())
cluster_centroids_kmeans.columns=df_data_train.columns[1:len(df_data_train.columns)]
cluster_centroids_kmeans

Unnamed: 0,p_throws,2f_ax,2f_ay,2f_az,2f_pct_usage,2f_pfx_x,2f_pfx_z,2f_plate_x,2f_plate_z,2f_release_extension,...,sl_release_extension,sl_release_pos_x,sl_release_pos_z,sl_release_speed,sl_release_spin_rate,sl_sz_bot,sl_sz_top,sl_vx0,sl_vy0,sl_vz0
0,0.964467,-0.178254,-0.07371738,0.061187,0.085152,-0.185108,0.066459,-0.1453121,-0.053865,0.05025877,...,0.08605,-0.34851,0.136777,-0.06207552,0.005796,-0.044644,-0.003179,0.337102,0.05798,-0.036798
1,0.05000001,1.309089,-0.5980115,0.201254,0.14232,1.343369,0.372146,0.928489,-0.037955,-0.5080095,...,-0.293915,0.442917,0.224854,-0.3693717,-0.006405,-0.079575,-0.106498,-0.398748,0.369415,0.131925
2,1.192093e-07,0.687806,-0.08451711,-0.097177,0.097085,0.686454,-0.073906,0.4290763,-0.017627,-0.00383218,...,-0.06469,1.22647,-0.044546,-0.153466,-0.06184,0.005386,-0.091032,-1.212968,0.155808,0.074195
3,1.0,-0.033098,-0.07288931,-0.040243,0.069479,-0.038659,-0.020269,-0.005879611,0.08811,-0.02897505,...,-0.283708,-0.123385,-0.187217,-0.02977845,-0.097162,-0.112012,-0.124603,0.106271,0.030944,0.217015
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.843966,-0.165261,0.009994,-1.540697,-0.887619,0.399806,0.709734,0.064469,1.550804,0.859778
5,1.0,-0.374134,0.49105,0.150886,0.140474,-0.349091,0.042992,-0.2905415,0.178951,-0.01474119,...,-0.024019,-0.422501,0.212902,0.6322104,0.089393,0.082233,0.034399,0.477446,-0.627175,-0.459397
6,1.0,-0.249759,-3.013914,-0.769798,0.690561,-0.519815,0.416976,0.2076755,-0.158409,-1.289145,...,0.0,0.0,0.0,1.862645e-09,0.0,0.0,0.0,0.0,0.0,0.0
7,1.0,-0.360175,-0.1218212,-0.428576,0.135838,-0.368228,-0.391597,-0.02659051,-0.040455,0.08885404,...,0.134926,-0.633305,-1.005179,-0.3246852,0.311132,-0.020933,-0.043656,0.516604,0.324208,0.818374
8,0.8571429,0.0,3.72529e-09,0.0,0.0,0.0,0.0,-7.450581e-09,0.0,-3.72529e-09,...,0.562721,-0.079125,-3.566267,-0.9160632,-0.328421,-0.318776,-0.385375,-0.084813,0.928679,2.549365
9,1.0,-0.018316,-0.1360602,-0.185542,0.00381,-0.03102,-0.160854,-0.1354281,0.146216,-0.2722305,...,-0.343919,-0.083105,0.186246,-0.6017559,-0.503449,0.193464,0.442727,0.074686,0.603449,0.090358
