## Clustering

- No labels


### About the Module


- Clustering is an unsupervised machine learning methodology for grouping and identifing similar objects, people, or observations.

    
    - We can create a new feature (or predictor) from this using these cluster ids, and use it in your ML or as a target.


- Clustering is often used as a preprocessing or an exploratory step in the data science pipeline so that the cluster that each item is assigned to becomes a feature for a supervised model.


- In this module, you will be introduced to various clustering algorithms and learn why and when to use them. You will learn how to use clustering methods to identify similar groups using Python using Scikit-Learn. You will learn how apply these clusters further down the pipeline.



### Use Cases

- Text: Document classification, summarization, topic modeling, recommendations


- Geographic: crime zones, housing prices


- Marketing: Customer segmentation, market research


- Anomaly detection: account takeover, security risk, fraud


- Image processing: radiology, security


### Vocabulary


- Euclidean Distance


- Manhattan Distance


- Cosine Similarity


- Sparse vs. Dense Matrix


- Manhattan (Taxicab) vs Euclidean Distance

### Data Types

- Input: continuous data, or ordered discrete data at a minimum.


- Output: Integer representing a cluster id.


    - The number itself doesn't mean anything except that those who share the same number are most similar. In addition, the number doesn't compare to any of the other cluster id's beyond the fact that they are different.

## Common Clustering Agorithms

### K-Means


- Description


    - most popular "clustering" algorithms.


    - stores k centroids that it uses to define clusters.


    - A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
    
    
    - K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
    
    
    - Python implementation: sklearn.cluster.KMeans

- PARAMETERS


- Number of clusters (k): The number of clusters to form, which is equal to the number of centroids to generate


- Number of initializations (n_init): The number of times the algorithm will 'begin', i.e. kick off with different centroid seeds


- Maximum Number of iterations (max_iter): If the algorithm doesn't converge prior, this is the maximum number of times the algorithm will loop through re-calculation of the centroids.


- random_state: Specific to sklearn, this is for 'setting the seed' for reproducibility. When you use any integer as a value here and then re-run with the same value, the algorithm will kick off with the same seed as before, thus the same observations & centroids.

- Pros


1. Performance scales well with the amount of data, i.e. the algorithm is linear in the number of objects $O(n)$

2. Creates tighter, more refined clusters

3. Centroids can be recomputed driving an observation or object to another cluster


- Cons


1. naive use of the mean value for the cluster center


2. fails when the clusters are not circular


3. Hard to predict what k (the number of clusters) should be


4. Which observations the clustering starts with, i.e. initial seeds, can dramatically affect the results


5. The order of the data can affect the results


6. Results are extremely sensitive to the scale of the data.


### Notes from DBScan lesson

- Clusters is about Density and Distance

- DBScan decides how many clusters, gives us labels, decides on outliers, uses conditions to decide on clusters

- Hyperparameters:

    - distance function (ie. - Euclidean)

    - eps: distance (epsilon), radius for the center of a cluster

    - minPts: # of points required to define a cluster

        - core point has minPts within its eps-neighborhood

        - reachable point is in the eps-neighborhood of a core point

        - if neither of the above conditions are met, the point is an outlier
    
- Drawbacks:

    - The only interpretation of this is distance and space

- Advantages:
    
    - Better with smaller datasets, computationally expensive

    - Outlier Detection (-1)
    
    - Finding anomalies in more than 2D



In [2]:
# this is a cool way if you need to
import os
print(os.getcwd())

/Users/faith/codeup-data-science/ds-methodologies-exercises/clustering


### Notes on End-to-End Clustering Example

- You can use clustering to create features to predict a target variable

    - OHE cluster IDs
    
    - Cluster on 30 variables
    
    - 8 Clusters remain
    
    - Find out which clusters add value to your prediction model.
    
        - You want your clusters to be somewhat balanced.
        
        - Use viz to identify clusters that matter to you

### Notes on Zillow Data for Project

- Audience is your class

- Work with a partner that is assigned

- Choose one of three ways to apply clustering

- Talk about the highlights or discoveries from the data and from the project itself, from your findings. 

    - what you learned as it relates to the domain, to the data, and to data science.
    
- Deliverable: notebook, supporting files, modules separated out for ease of walking the class through your project.

- Clustering:

    - Use the clusters as possible features, model with them.
    
    - Use the clusters for exploration, to make discoveries, to decide on a group of variables to dive into or a group of features that may be a waste of time/resources.
    
    - Use the clusters for your target, an option for binning.
    
- Analysis/Takeaway and Modeling

    - The aim is to predict your target variable (logerror)
    
    - Your model should be able to guess which zestimates will be the most inaccurate, the patterns in the residuals
    
    - You have a continuous variable that you want to predict, in a nutshell
    
- Process:

    - Acquire data
    
    - Prep: Handle Nulls, Outliers, Viz Dist., Drop Variables
    
    - Split
    
        - Impute here if you are going to. Fit the imputer using the train data.
        
            - fit returns the mean, transform imputes the mean onto the train and test
    
    - Scale: Think about how you want to scale the diff data types
    
    - Explore: Viz, stats testing, clustering, crosstabs, think about the data types of the variables you are comparing. The dtypes will determine how to viz and what kinds of stats tests to run
    
        - You have to think about aggregating sometimes when the data is too granular to really visualize.
    
- Must haves:

    -project planning, modules, stats testing of clusters and features, viz of clusters, clusters, model, summary of key drivers of target
    
        - All planning in README: goals, hypothesis, doodles and graph ideas, data dictionary