## Clustering

- No labels


### About the Module


- Clustering is an unsupervised machine learning methodology for grouping and identifing similar objects, people, or observations.

    
    - We can create a new feature (or predictor) from this using these cluster ids, and use it in your ML or as a target.


- Clustering is often used as a preprocessing or an exploratory step in the data science pipeline so that the cluster that each item is assigned to becomes a feature for a supervised model.


- In this module, you will be introduced to various clustering algorithms and learn why and when to use them. You will learn how to use clustering methods to identify similar groups using Python using Scikit-Learn. You will learn how apply these clusters further down the pipeline.



### Use Cases

- Text: Document classification, summarization, topic modeling, recommendations


- Geographic: crime zones, housing prices


- Marketing: Customer segmentation, market research


- Anomaly detection: account takeover, security risk, fraud


- Image processing: radiology, security


### Vocabulary


- Euclidean Distance


- Manhattan Distance


- Cosine Similarity


- Sparse vs. Dense Matrix


- Manhattan (Taxicab) vs Euclidean Distance

### Data Types

- Input: continuous data, or ordered discrete data at a minimum.


- Output: Integer representing a cluster id.


    - The number itself doesn't mean anything except that those who share the same number are most similar. In addition, the number doesn't compare to any of the other cluster id's beyond the fact that they are different.

## Common Clustering Agorithms

### K-Means


- Description


    - most popular "clustering" algorithms.


    - stores k centroids that it uses to define clusters.


    - A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
    
    
    - K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
    
    
    - Python implementation: sklearn.cluster.KMeans

- PARAMETERS


- Number of clusters (k): The number of clusters to form, which is equal to the number of centroids to generate


- Number of initializations (n_init): The number of times the algorithm will 'begin', i.e. kick off with different centroid seeds


- Maximum Number of iterations (max_iter): If the algorithm doesn't converge prior, this is the maximum number of times the algorithm will loop through re-calculation of the centroids.


- random_state: Specific to sklearn, this is for 'setting the seed' for reproducibility. When you use any integer as a value here and then re-run with the same value, the algorithm will kick off with the same seed as before, thus the same observations & centroids.

- Pros


1. Performance scales well with the amount of data, i.e. the algorithm is linear in the number of objects $O(n)$

2. Creates tighter, more refined clusters

3. Centroids can be recomputed driving an observation or object to another cluster


- Cons


1. naive use of the mean value for the cluster center


2. fails when the clusters are not circular


3. Hard to predict what k (the number of clusters) should be


4. Which observations the clustering starts with, i.e. initial seeds, can dramatically affect the results


5. The order of the data can affect the results


6. Results are extremely sensitive to the scale of the data.
