# Data Mining: Cluster Analysis

Data Mining is the exploration, investigation, and analysis of data to find patterns and discover otherwise unknown relationships within the data. It usually invovles a combination of human intervention/intelligence and the use of any number of algorithms (many of them over and over again for a variety of different types of data).

Many of these common alogorithms are also used for machine learning. But technically, data mining and machine learning are ***not*** the same thing. The difference is that data mining analyzes the data without making future predictions. Machine learning analyzes the data and then makes a prediction about what will happen in the future.

For this chapter, we will be exploring Cluster Analysis and implementing the K-Means algorithms. Cluster Analysis is a type of problem that applies to both data mining and machine learning. K-Means is an algorithm that is used for data mining and machine learning. But we will be restricting our work to Data Mining.

In [2]:
import kmeans

student_grades = {"Alice":(93, 88), "Bob":(55, 55), "Charles":(90, 87), "Dave":(63,57), "Ellen":(89,88), 
                  "Frita":(96,94), "Grant":(70,86), "Heidi":(98,96), "Isabelle":(86,94), "Jack":(88,94), 
                  "Kate":(60,86), "Lisa":(85,86), "Mary":(90,95), "Nancy":(63,58), "Orville":(88,61),
                  "Peter":(95,58), "Quinton":(83,89), "Ralph":(57,65), "Sally":(67,65), "Trent":(62,62),
                  "Ursala":(65,53), "Violet":(82,90), "Wally":(91,93), "Xavier":(81, 84), "Yolanda":(90, 63),
                  "Zack":(85,56)}

colors = ["royalblue", "forestgreen", "maroon"]
centroids, clusters = kmeans.CreateKMeansClusters1(3, student_grades, colors, "Homework Scores", "Final Exam Scores", inches=7)

TypeError: CreateKMeansClusters() got an unexpected keyword argument 'inches'

## Revisit K-Means to Add Outlier Removal

In [None]:
import statistics 

# Using interquartile range (IQR) for outlier detection
# https://medium.com/analytics-vidhya/effect-of-outliers-on-k-means-algorithm-using-python-7ba85821ea23
# We don't care about the lower threshold because that would mean that the datapoint is 
# exceptionally close to the centroid (good)... we only care about the upper threshold

def GetIQROutlierThreshold(data, cluster_list, centroid_list):

    # We need to a sorted list of all of the Euclidean Distances in this cluster
    distances = []
    for cluster, centroid in zip(cluster_list, centroid_list):
        distances += [round(kmeans.EuclidDist(student_grades[key], centroid), 2) for key in cluster]
    distances = sorted(distances)
    
    outlier_threshold = 0
    
    # Divide up the data into 1st-half / 2nd-half and then 1st-quarter / 3rd-quarter
    # 1st-quarter is the median of the 1st-half data
    # 3rd-quarter is the median of the 2nd-half data
    # Interquartile Range is the difference between Q1 and Q3
    # Return Q3 + (1.5 * IQR) as the threshold

    
    
    return outlier_threshold

outlier_threshold = GetIQROutlierThreshold(student_grades, clusters, centroids)
outlier_threshold

In [None]:
def RemoveOutliersFromCluster(data, cluster, centroid, threshold):

    keys_to_remove = []
    
    # Step through each cluster and check if a point's Euclidean Distance to the centroid
    # is greater than our threshold value... if so, mark the point for removal
    
    # TODO: MARK POINTS FOR REMOVAL THAT ARE BEYOND THE OUTLIER THRESHOLD
    

    # Remove each of the outliers
    for key in keys_to_remove:
        cluster.remove(key)
        
    return keys_to_remove


def RemoveOutliers(data, clusters, centroids, threshold):
    
    outliers = []
    for cluster, centroid in zip(clusters, centroids):
        outliers += RemoveOutliersFromCluster(data, cluster, centroid, threshold)
        
    clusters.append(outliers)
    return outliers
        
RemoveOutliers(student_grades, clusters, centroids, outlier_threshold)

In [None]:
import matplotlib.pyplot as plt

def PlotClusters2Dv2(data, cluster_list, centroid_list, color_list, x_label=None, y_label=None, inches=5):
      
    plt.figure(figsize=(inches, inches))

    centroid_colors = color_list.copy()
    if len(cluster_list) > len(centroid_list):
        color_list.append("silver")    
    
    data_x = [data[item][0] for cluster in cluster_list for item in cluster]
    data_y = [data[item][1] for cluster in cluster_list for item in cluster]
    data_c = [color_list[i] for i, cluster in enumerate(cluster_list) for item in cluster]
    plt.scatter(data_x, data_y, c=data_c) 

    centroids_x = [centroid[0] for centroid in centroid_list]
    centroids_y = [centroid[1] for centroid in centroid_list]
    centroids_c = centroid_colors
    plt.scatter(centroids_x, centroids_y, c=centroids_c, s=[1000, 1000, 1000], alpha=0.3)    

    plt.xlabel(x_label, fontsize=16, labelpad=15)
    plt.ylabel(y_label, fontsize=16, labelpad=15)

    plt.show()
    return

In [None]:
k = 3
colors = ["royalblue", "forestgreen", "maroon"]
PlotClusters2Dv2(student_grades, clusters, centroids, colors, "Homework Scores", "Final Exam Scores")