# SI 618 - Homework 7 - Clustering Bob Ross paintings using k-means

## Objectives
* Practice k-means clustering
* Gain experience moving between pandas and Spark (both ways)

## Submission Instructions:
Please turn in this Databricks notebook in HTML format as well as the URL to the published version of this notebook via Canvas.

## Background
### Bob Ross
For this particular exercise we're going to use k-means clustering to analyse Bob Ross paintings.  I was inspired by the FiveThirtyEight article on his work, which you should read for background:

https://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/

For those of you unfamiliar with Bob ("Happy Trees") Ross... check out his videos on YouTube (e.g. https://www.youtube.com/watch?v=kJFB6rH3z2A) and the Wikipedia page on him.  We're going to use the data provided by fivethirtyeight but will also augment it with the actual images. We've downloaded a few hundred thumbnails and will use those as well.

**NOTE:** Do not attempt to just run this entire notebook.  Read over each step before you run it and try to understand what's going on.

### CODE FOR DATA PREPARATION AND THE FIRST K-MEANS ANALYSIS IS DONE FOR YOU!
You should study and attempt to understand the code before you move on to 
completing parts 2-6 (and, optionally, one of the "Above and Beyond" tasks).

### As a first step, let's load the libraries we need:

In [5]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from PIL import Image
import seaborn as sns
import matplotlib.cm as cm
import re
import os.path

### Getting the data from Amazon Web Services S3

We need the CSV file with data (bob_ross.csv), as well as a collection of images to complete this assignment.
One way for us to share those with you is to put them in an AWS S3 bucket and get you to "mount" that bucket
as a directory that's accessible via this notebook.

The following code block does exactly that, making the bucket containing those files available to this notebook.  To Spark, it will look like the files live in a directory called ```/mnt/si330w18```.  
To pandas, which we will use to read the data, the files will live in ```/dbfs/mnt/si330w18```.  Note the use of ```/dbfs``` as a prefix in the pandas version.

You should be able to just run the next code block.  At the end of the code block is a command to list the contents of the 
mounted S3 bucket.

In [7]:
ACCESS_KEY = 
SECRET_KEY = 
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "umsi-data-science-west"
MOUNT_NAME = "umsi-data-science"
try:
  dbutils.fs.unmount("/mnt/%s/" % MOUNT_NAME)
except:
  print("Could not unmount %s, but that's ok." % MOUNT_NAME)
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/umsi-data-science/si618wn2017"))

### Defining a helper function to simplify the color space of images
The next code block sets up a utility function (```getColors```), which takes an image and figures out which colors are used.
It reduces the color space to about 85 colors (from an original space of 65536 colors) and returns the normalized count of 
each color's appearance in the image.

The function takes as input the filename of an image file.  It opens the file and sets up a numpy array of zeros for each of the
85 output colors.  The function then goes through all of the pixels in the image and calculates the red, green and blue 
color values in the reduced space (that's why we divide each of the values for red, green and blue by 63).  We then put
the red, green and blue values back together again by bit-shifting the green and blue values and then using a logical 'or'.
Let's say we had a color of 126,189,252 (which is an pleasant blue color).  Dividing those values by 63, we get 2,3,4.
Bit-shifting 3 << 2 gives us 12, 4 << 4 gives us 64.  We don't bit-shift the red values, so we just keep the 2.  Adding those
together (equivalent to using a logical "or" on the bit-shifted values) gives us 78, so we would increment the count of color 78.

Finally, we convert all counts to proportions and return the proportions of each color as a numpy array.

In [9]:
def getColors(img):
    im = Image.open(img, 'r')
    width, height = im.size
    #print(img,width,height)
    pixel_values = list(im.getdata())
    cnt = np.zeros(85,dtype=int)
    for i in pixel_values:
        #print(i)
        r = int(i[0]/63)
        g = int(i[1]/63)<<2
        b = int(i[2]/63)<<4
        x = r | g | b
        #print(x)
        cnt[x] = cnt[x] + 1
        #print(cnt[x])
    cnt = cnt/float(sum(cnt))
    return(cnt)

### Loading the "tags' file into a pandas DataFrame
First, we're going to load the CSV file of the human-assigned tags for each of Bob's paintings into a **pandas** DataFrame.  Remember that we mounted the AWS S3 bucket containing the data as ```/mnt/umsi-data-science/si618wn2017``` and the CSV file is named ```bob_ross.csv```.  We can read the file using the (hopefully)
familiar ```.from_csv()``` method in pandas:

In [11]:
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")

Let's take a look at the contents:

In [13]:
bob_ross.head()
bob_ross.shape

The above command should show you that you have a pandas DataFrame with 5 rows and 68 columns.  These are the "tags" for each of the images that
we will load.  The tags were generated by people, and indicate the presence or absence of various features (e.g. "BEACH"), which is set to 1 if the 
feature is present or 0 if the feature is not present.

## NOTE: The next code block takes a very long time (about 5 minutes) to complete.  Wait for it!

In [16]:
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')

# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
    b = bob_ross.loc[s]['TITLE']
    b = b.lower()
    b = re.sub(r'[^a-z0-9\s]', '',b)
    b = re.sub(r'\s', '_',b)
    img = b+".png"
    if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
        bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
        t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
        colors.loc[s] = t


In [17]:
# join the colors and tag database and toss the rows where we don't have an image
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""] 

In [18]:
# these are masks you might find handy to only get the colors, the tags, or both (as well as the image path)
color_columns = ['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10',
               'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c20',
               'c21', 'c22', 'c23', 'c24', 'c25', 'c26', 'c27', 'c28', 'c29', 'c30',
               'c31', 'c32', 'c33', 'c34', 'c35', 'c36', 'c37', 'c38', 'c39', 'c40',
               'c41', 'c42', 'c43', 'c44', 'c45', 'c46', 'c47', 'c48', 'c49', 'c50',
               'c51', 'c52', 'c53', 'c54', 'c55', 'c56', 'c57', 'c58', 'c59', 'c60',
               'c61', 'c62', 'c63', 'c64', 'c65', 'c66', 'c67', 'c68', 'c69', 'c70',
               'c71', 'c72', 'c73', 'c74', 'c75', 'c76', 'c77', 'c78', 'c79', 'c80',
               'c81', 'c82', 'c83', 'c84']
tag_columns = ['APPLE_FRAME', 'AURORA_BOREALIS', 'BARN', 'BEACH', 'BOAT',
       'BRIDGE', 'BUILDING', 'BUSHES', 'CABIN', 'CACTUS', 'CIRCLE_FRAME',
       'CIRRUS', 'CLIFF', 'CLOUDS', 'CONIFER', 'CUMULUS', 'DECIDUOUS',
       'DIANE_ANDRE', 'DOCK', 'DOUBLE_OVAL_FRAME', 'FARM', 'FENCE', 'FIRE',
       'FLORIDA_FRAME', 'FLOWERS', 'FOG', 'FRAMED', 'GRASS', 'GUEST',
       'HALF_CIRCLE_FRAME', 'HALF_OVAL_FRAME', 'HILLS', 'LAKE', 'LAKES',
       'LIGHTHOUSE', 'MILL', 'MOON', 'MOUNTAIN', 'MOUNTAINS', 'NIGHT', 'OCEAN',
       'OVAL_FRAME', 'PALM_TREES', 'PATH', 'PERSON', 'PORTRAIT',
       'RECTANGLE_3D_FRAME', 'RECTANGULAR_FRAME', 'RIVER', 'ROCKS',
       'SEASHELL_FRAME', 'SNOW', 'SNOWY_MOUNTAIN', 'SPLIT_FRAME', 'STEVE_ROSS',
       'STRUCTURE', 'SUN', 'TOMB_FRAME', 'TREE', 'TREES', 'TRIPLE_FRAME',
       'WATERFALL', 'WAVES', 'WINDMILL', 'WINDOW_FRAME', 'WINTER',
       'WOOD_FRAMED']
all_columns = color_columns + tag_columns + ['image']
color_columns = color_columns + ['image']
tag_columns = tag_columns + ['image']

In [19]:
# this is a utility function for displaying a grid of images, with an optional heading
def display_images(imagelist,cluster_title=None):
    a = imagelist.apply(lambda x: re.search('(\w+.png)', x).group(1))
    np.zeros(7-len(a)%7,dtype=np.str)
    a = np.append(a,np.zeros(7-len(a)%7,dtype=np.str))
    grid = a.reshape(int(len(a)/7),7)
    text = ""
    if (cluster_title != None):
       text = "<h1>"+cluster_title+"</h1>\n" 
    text = text + "<table>"
    for i in np.arange(0,len(grid)):
        row = grid[i]
        line = ''.join( ["\n<TD><img style='width: 120px; margin: 0px; float: left; border: 1px solid black;' src='https://s3.amazonaws.com/si618image/images/%s' /></TD>" % str(s) for s in row])
        text = text + "<TR>"+line+"</TR>\n"
    text = text +"</table>"
    displayHTML(text)

In [20]:
# for example, we can display the first 12 images
display_images(bob_ross.image[0:11],"sample images")

## K-means
### 1) K-Means on tags (2 clusters)

We're going to start by replicating the fivethirtyeight article a bit.  Using *only* the tags, perform a k-means clustering with 2 clusters. Use display_images to show the images from each cluster.

**We are going to move our data into Spark for this analysis.**

In [22]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.feature import VectorAssembler

tags = spark.createDataFrame(bob_ross[tag_columns[:-1]])

assembler = VectorAssembler(
    inputCols=tag_columns[:-1],
    outputCol="features")

tags_assembled = assembler.transform(tags)
tags_assembled.select("features").show(5, truncate=False)

In [23]:
# Create k-means model, k=2, and fit the data
kmeans = KMeans().setK(2).setSeed(1)
kmeans_model = kmeans.fit(tags_assembled)

# Make predictions
tags_predictions = kmeans_model.transform(tags_assembled)

In [24]:
# Evaluate clustering by computing WSSSE.
wssse = kmeans_model.computeCost(tags_assembled)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(tags_predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

In [25]:
# Show the clustering centers
centers = kmeans_model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

In [26]:
# Show the clustering membership
tags_transformed = kmeans_model.transform(tags_assembled)
tags_transformed[['prediction']].show(5)

### Now move back into pandas...

In [28]:
bob_ross["prediction"] = tags_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_0['image']," Cluster 1 Images")

In [29]:
df_1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_1['image'],"Cluster 2 Images")

### 2) Describe the differences

Cluster 1 images have both mountains, typically snow-capped or white, and trees, typically in the foreground. In contrast, Cluster 2 images tend to have trees and cabins, but not mountains. The types of trees are also different; in Cluster 1, the trees are "evergreen," meaning they do not lose their leaves. In Cluster 2, the trees are deciduous -- meaning they lose their leaves. There are many images in cluster two where the trees are leaf-less, or have leaves that are changing colors (i.e., yellow or red or brown).

### 3) Calculate the differences between clusters

One thing we can do to compare the clusters is to determine which tags show up more in the first cluster and which ones appear more in the second. Write code to determine which tags are maximally different between the two clusters.  You should get output that looks like:
```
MOUNTAIN              0.967647
SNOWY_MOUNTAIN        0.681513
MOUNTAINS             0.638655
CONIFER               0.515126
LAKE                  0.294958
```

Hint: you can do this with some combination of masks, .mean() and .sort_values() all in one line (but feel free to write a loop if it's easier to think about)

In [32]:
# we have to re-create tag_columns, which includes images in the last column (from cell 19)
tag_cols = tag_columns[:-1]
# Now loop through tags
tag_frequencies = []
for column in tag_cols:
  diff = abs(df_0[column].mean() - df_1[column].mean())
  tag_frequencies.append(diff)

table = pd.DataFrame({'Tag':tag_cols, 'Difference':tag_frequencies})
order = ['Tag','Difference']
table = table[order].sort_values('Difference', ascending=False)

In [33]:
table.head()
# note a kept getting an error when trying to set Tag as the index, to more closely replicate your example result.

### 4) Find a better value of k

In [35]:
# Method 1: Use your eyeballs!
# display_images(bob_ross.image,"All Images")

**Use display_images to show the different clusters, pick the best value of k, and describe your clusters qualitatively.**

In [37]:
# Method 2: "Rule of Thumb"
# do you round to 14?
guess = np.sqrt(tags_predictions.count()/2)
guess

In [38]:
# method 3: Scree plot
cost = list()
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    kmeans_model = kmeans.fit(tags_assembled)
    cost.append(kmeans_model.computeCost(tags_assembled))
print(cost)

In [39]:
fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for K-Means clustering');

display(fig)

The elbow method suggests 6 clusters.

In [41]:
# method 4: Silhouette scores
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    kmeans_model = kmeans.fit(tags_assembled)
    tags_predictions = kmeans_model.transform(tags_assembled)
    silhouette = evaluator.evaluate(tags_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.plot(range(2,20)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xticks(np.arange(1, 21, 1.0)) # this helps us see the number of clusters we want!
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for k-means clustering')

display(fig)

The silhouette score suggests we have three clusters. Use this in the code below:

In [43]:
# Now re-run analysis with k = 3
kmeans = KMeans().setK(3).setSeed(1)
kmeans_model = kmeans.fit(tags_assembled)
tags_predictions = kmeans_model.transform(tags_assembled)
#tags_transformed = kmeans_model.transform(tags_assembled)
bob_ross["prediction"] = tags_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_0['image']," Cluster 1 Images")


In [44]:
df_1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_1['image']," Cluster 2 Images")

In [45]:
df_2 = bob_ross[bob_ross["prediction"] == 2]
display_images(df_2['image']," Cluster 3 Images")

When k=3, we can easily see the distinct clusters. 
1. Typically pictures of oceans and/or waves, with pinkish sunsets.
2. Scenes of trees with no mountains; many have cabins.
3. Scenes of mountains and trees, typically in dark blue, grey, and green colors.

### 5) k-means based on colors
Perform k-means clustering on the paintings using *only* the color columns. Decide a good value for k, execute the clustering, display the images in each clusters, and describe the resulting clusters.

In [48]:
colors = spark.createDataFrame(bob_ross[color_columns[:-1]])

assembler = VectorAssembler(
    inputCols=color_columns[:-1],
    outputCol="features")

colors_assembled = assembler.transform(colors)

In [49]:
# Create k-means model, k=2, and fit the data
kmeans = KMeans().setK(2).setSeed(1)
kmeans_model = kmeans.fit(colors_assembled)

# Make predictions
colors_predictions = kmeans_model.transform(colors_assembled)

# Evaluate clustering by computing WSSSE.
wssse = kmeans_model.computeCost(colors_assembled)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(colors_predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

What's an appropriate number of clusters?

In [51]:
# Scree plot
cost = list()
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    kmeans_model = kmeans.fit(colors_assembled)
    cost.append(kmeans_model.computeCost(colors_assembled))

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for K-Means clustering');

display(fig)

The elbow method suggests 8 clusters.

In [53]:
# Check Silhouette scores
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    kmeans_model = kmeans.fit(colors_assembled)
    colors_predictions = kmeans_model.transform(colors_assembled)
    silhouette = evaluator.evaluate(colors_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.plot(range(2,20)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xticks(np.arange(1, 21, 1.0)) # this helps us see the number of clusters we want!
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for k-means clustering')

display(fig)

The Silhouette scores of using just colors suggest two clusters, which I'll do below:

In [55]:
# Run analysis with k = 2
kmeans = KMeans().setK(2).setSeed(1)
kmeans_model = kmeans.fit(colors_assembled)
colors_predictions = kmeans_model.transform(colors_assembled)
bob_ross["prediction"] = tags_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_0['image']," Cluster 1 Images")

In [56]:
df_1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_1['image']," Cluster 2 Images")

When k=2, we can easily see the distinct colors.
1. Lighter colors -- purple, pink, magenta, and yellow.
2. Darker colors -- green, blue, and brown.

### 6) Use both tags and colors for k-means clustering

Perform k-means clustering on the paintings using *both* tag and color columns. Decide a good value for k, execute the clustering, display the images in each clusters, and describe the resulting clusters.

In [59]:
colors_tags = spark.createDataFrame(bob_ross[all_columns[:-1]])

assembler = VectorAssembler(
    inputCols=all_columns[:-1],
    outputCol="features")

all_assembled = assembler.transform(colors_tags)

# Create k-means model, k=2, and fit the data
kmeans = KMeans().setK(2).setSeed(1)
kmeans_model = kmeans.fit(all_assembled)
all_predictions = kmeans_model.transform(all_assembled)

In [60]:
# Scree plot
cost = list()
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    kmeans_model = kmeans.fit(all_assembled)
    cost.append(kmeans_model.computeCost(all_assembled))

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.xticks(np.arange(1, 21, 1.0)) # this helps us see the number of clusters we want!
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for K-Means clustering');

display(fig)

The elbow method suggests five clusters.

In [62]:
# Check Silhouette scores
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    kmeans_model = kmeans.fit(all_assembled)
    all_predictions = kmeans_model.transform(all_assembled)
    silhouette = evaluator.evaluate(all_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.plot(range(2,20)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xticks(np.arange(1, 21, 1.0)) # this helps us see the number of clusters we want!
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for k-means clustering')

display(fig)

The elbow method suggests 5, but the silhouette score suggests 2. Let's try 5, below:

In [64]:
# Run analysis with k = 5
kmeans = KMeans().setK(5).setSeed(1)
kmeans_model = kmeans.fit(all_assembled)
all_predictions = kmeans_model.transform(all_assembled)
bob_ross["prediction"] = all_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_0['image']," Cluster 1 Images")

In [65]:
df_1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_1['image']," Cluster 2 Images")

In [66]:
df_2 = bob_ross[bob_ross["prediction"] == 2]
display_images(df_2['image']," Cluster 3 Images")

In [67]:
df_3 = bob_ross[bob_ross["prediction"] == 3]
display_images(df_3['image']," Cluster 4 Images")

In [68]:
df_4 = bob_ross[bob_ross["prediction"] == 4]
display_images(df_4['image']," Cluster 5 Images")

When k = 5, and we use both tags and colors, the clusters are:
1. Blue and white mountains, with green trees.
2. White mountains, many with cabins.
3. Green forests, many with a river.
4. Oceans and waves, often with a light sunset.
5. Trees with light green or orange leaves, and lighter sunsets.

## Above and Beyond
2. Repeat the analysis for Step 6 (both tags and colors) using bisecting k-means **and compare the results to k-means**. Describe, in detail, how the resulting clusters differ.  The majority of your work should go into exploring the differences in the results.

In [71]:
from pyspark.ml.clustering import BisectingKMeans

bkm = BisectingKMeans().setK(3).setSeed(1)
bkm_model = bkm.fit(all_assembled)
all_predictions = kmeans_model.transform(all_assembled)


In [72]:
# Scree plot
cost = list()
for k in range(2,20):
    bkm = BisectingKMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(all_assembled)
    cost.append(bkm_model.computeCost(all_assembled))

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.xticks(np.arange(1, 21, 1.0)) # this helps us see the number of clusters we want!
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for Bisecting K-Means clustering');

display(fig)

In [73]:
# Check Silhouette scores
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,20):
    bkm = BisectingKMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(all_assembled)
    all_predictions = bkm_model.transform(all_assembled)
    silhouette = evaluator.evaluate(all_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,20), cost, 'b*-')
plt.plot(range(2,20)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xticks(np.arange(1, 21, 1.0)) # this helps us see the number of clusters we want!
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for Bisecting k-means clustering')

display(fig)

In [74]:
# Run analysis with k = 5
bkm = BisectingKMeans().setK(5).setSeed(1)
bkm_model = bkm.fit(all_assembled)
all_predictions = kmeans_model.transform(all_assembled)
bob_ross["prediction"] = all_predictions.select("prediction").toPandas().set_index(bob_ross.index)

In [75]:
df_0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_0['image']," Cluster 1 Images")


In [76]:

df_1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_1['image']," Cluster 2 Images")


In [77]:

df_2 = bob_ross[bob_ross["prediction"] == 2]
display_images(df_2['image']," Cluster 3 Images")


In [78]:

df_3 = bob_ross[bob_ross["prediction"] == 3]
display_images(df_3['image']," Cluster 4 Images")



In [79]:
df_4 = bob_ross[bob_ross["prediction"] == 4]
display_images(df_4['image']," Cluster 5 Images")