# Week 8 Review

# Unsupervised Learning
---

## Intro to Unsupervised Learning
**Define supervised and unsupervised learning**
- Supervised learning has a known target (y) 
- y values are part of our training data
- Unsupervised has no known target (no y)


**Type of problems for unsupervised learning**
- Clustering
    - Creating profiles of customers or other things
- PCA
    - Focus on important information in features 
- Network Analysis
    - Credit card fraud
    - Connections on social media
- Topic Modeling in NLP
    
- Can use unsupervised learning in a supervised learning pipeline


# See amazing demonstrations [here](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/)

## Reminder: 
### Don't forget to scale your data before doing K-means or DBSCAN

These are distance-based algorithms. 
If a feature is on a different scale it can overwhelm other features.

## K-Means

**Understand basic unsupervised clustering problems**
- Group similar points or groups together
- Ride sharing 
- Find items with similar behavior (users, products, voters, etc)
- Market segmentation
- Understand complex systems
- Discover meaningful categories for your data
- Reduce the number of classes by grouping (e.g. bourbons, scotches -> whiskeys)
- Feature reduction
- Pre-processing! Create labels for supervised learning

**Key terms**
- K: Number of cluster
- Means: mean points of the cluster
- Centroid : center of cluster

![clusters](images/kmeans.png)

(source: global lecture)

* Steps for K-Means Clustering**
    1. Pick a value for k (the number of clusters to create).
    1. Initialize k 'centroids' (starting points) in your data.
    1. Create your clusters. Assign each point to the nearest centroid.
    1. Make your clusters better. Move each centroid to the center of its cluster.
    1. Repeat steps 3-4 until your centroids converge.



**Evaluation Metrics:**
- Silhouette Score
    - measure of how far apart clusters are
    - high Silhouette = clusters are well separated
    
```
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

k = 5
cl = KMeans(n_clusters=k)
cl.fit(X)
inertia = cl.inertia_
sil = silhouette_score(X, cl.labels_)

```
    
- Inertia
    - sum of squared errors for each cluster
    - compactness of a cluster (lower value)
    
**Picking k:**
- Use prior knowledge!
- Look at elbow in Silhouette or Inertia scree plot

![image.png](images/scree_plot.png)

Source: Images from global lecture.

## DBSCAN

**What is DBSCAN?**
- Density based clustering

**How does DBSCAN work?**
1. Choose epsilon: defines a distance boundary from a point
1. Choose min_num: the minimum number of points in the boundary
1. Pick random start point
1. Check if min number of points are in boundary
    - If yes, new cluster, then moves to new random point in boundary
    - If no, moves to new random point
1. Stops when all points have been checked

**How does DBSCAN compare to K-Means and Hierarchical Clustering?**
    - K means needs a k selected, not necessary to choose nubmer of clusters in DBSCAN
    - Reguired ensity of the observations can be set 
    - DBSCAN performs well with odd shaped clusters, less well when the clusters aren't clearly separated
    
**Implementation**

```
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=2.3, min_samples=4)
dbscan.fit(X_scaled)
silhouette_score(X_scaled, dbscan.labels_)
```

## Principal Component Analysis

- Rotate the data so the first component is along the X-axis. The second component is at 90 degrees (also known as orthogonal, perpendicular). And so on.

**Differentiate between feature elimination and feature extraction.**
- Feature Elimination = Dropping features 
- Feature Extraction = Combine existing features into new ones to reduce the number of features

**Describe the PCA algorithm.**
- Create a series of weights (eigenvectors) explaining the most variance in observations (PC1)
- Along new principal component axis, creates new series of weights explaining most variance (PC2) , and so on. 
- $PC_x = W_i,jX +....W_nX_n $
- Two assumption
    - Large variance defines importance
    - Linear relationship
    


**Use cases for PCA.**
- Too many columns versus rows
- High dimensionality
- Addressing Multicollinearity problem in fitting model 
- Speed up training and prediction time
- Reduce storage space

**Implement PCA in scikit-learn.**
1. ```Import PCA from sklearn.decomposition```
1. Scale features (if necessary)
1. Instantiate PCA
1. Fit & transform
1. Only include X because this is unsupervised, no y
1. Review explained variance

In [1]:
from sklearn.decomposition import PCA
pca = PCA()       
Z_train = pca.fit_transform(X_train)

NameError: name 'X_train' is not defined

**Calculate and interpret proportion of explained variance.**
- Explained variance: How much variance per component relative to the total variance of all observations.
- ```pca.explained_variance_ratio_```
- Cumulative variance: summing variance across all PC (PC1 + PC2 + ….) until threshold is meet (80%,95%,etc)

The `PCA(n_components)` argument can take a number of components desired or a decimal for % of cumulative variance desired

## Missing Data

### Types of missing data

- MCAR (Missing Completely at Random)
    - I'm a sleepy graduate student working in a lab, while pipetteing, I reach over to grab my pen but accidentally knock three petri dishes off the desk, from these petri dishes I lose all the data that I would have otherwise collected. 
        - The data of interest is not systematically different between respondents and nonrespondents.
    
- MAR (Missing at Random) 
    - I work in a lab that contains remote sensors. One sensor broke and thus did not gather information from 6:00 AM to 10:00 AM
        - Conditional on data we have observed, the data of interest is not systematically different between respondents and nonrespondents.
        - In this case, accounting for time can help account for the missingness!
    
    
- NMAR (Not Missing at Random)
    - I administer a survey that contains a question about income. Those who have lower incomes are less likely to respond about the question about income
        - The data of interest are systematically different for respondents and nonrespondents.
        - Whether or not an observation is missing depends on the value of the unobserved data itself!



## Recommender Systems, Cosine Similarity, and Sparse Matrices

- Collaborative-based - Who else watched GOT and likes shows that are similar to GOT. Determined by ratings, usually.

- Content-based - What is the content of GOT that's similar to other shows you're getting recommendations for.

Use **cosine similarity** to determine which to recommend.

![](https://neo4j.com/docs/graph-algorithms/current/images/cosine-similarity.png)

Source: https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/cosine/

![](https://www.oreilly.com/library/view/mastering-machine-learning/9781785283451/assets/d258ae34-f4f8-4143-b3c2-0cb10f2b82de.png)

Source: [Machine Learning Mastery book](https://www.oreilly.com/library/view/mastering-machine-learning/9781785283451/assets/d258ae34-f4f8-4143-b3c2-0cb10f2b82de.png)

- Cosine similarity is a measure of the angle between the vectors. 

- Euclidian and Manhattan (L1 and L2) are two distance-based metrics that can be applied to vectors.

- Each column is a vector.

**Cold start problem** - difficult at first when you don't have many ratings from a new user.

### Sparse matrix 
- makes it efficient to store lots of 0s
- in contrast, the arrays you are used to seeing are **dense**

In [None]:
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity

Check out [this CrossValidated answer](https://stats.stackexchange.com/a/235676/198892) on z-score, cosine-similarity, and pearson correlation coefficient to help put a number of pieces together. 😀