# Unsupervised Learning

When new samples are presented to a K-means model that already fitted old samples, this model remembers means of each cluster, called "centroids". Then new samples are assigned to the cluster whose centroid is closest.

Clustering 2D points

From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points.

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)


[1 2 0 1 2 1 2 2 2 0 1 2 2 0 0 2 0 0 2 2 0 2 1 2 1 0 2 0 0 1 1 2 2 2 0 1 2
 2 1 2 0 1 1 0 1 2 0 0 2 2 2 2 0 0 1 1 0 0 0 1 1 2 2 2 1 2 0 2 1 0 1 1 1 2
 1 0 0 1 2 0 1 0 1 2 0 2 0 1 2 2 2 1 2 2 1 0 0 0 0 1 2 1 0 0 1 1 2 1 0 0 1
 0 0 0 2 2 2 2 0 0 2 1 2 0 2 1 0 2 0 0 2 0 2 0 1 2 1 1 2 0 1 2 1 1 0 2 2 1
 0 1 0 2 1 0 0 1 0 2 2 0 2 0 0 2 2 1 2 2 0 1 0 1 1 2 1 2 2 1 1 0 1 1 1 0 2
 2 1 0 1 0 0 2 2 2 1 2 2 2 0 0 1 2 1 1 1 0 2 2 2 2 2 2 0 0 2 0 0 0 0 2 0 0
 2 2 1 0 1 1 0 1 0 1 0 2 2 0 2 2 2 0 1 1 0 2 2 0 2 0 0 2 0 0 1 0 1 1 1 2 0
 0 0 1 2 1 0 1 0 0 2 1 1 1 0 2 2 2 1 2 0 0 2 1 1 0 1 1 0 1 2 1 0 0 0 0 2 0
 0 2 2 1]

Inspect your clustering

Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so new_points is an array of points and labels is the array of their cluster labels.


    Import matplotlib.pyplot as plt.
    Assign column 0 of new_points to xs, and column 1 of new_points to ys.
    Make a scatter plot of xs and ys, specifying the c=labels keyword arguments to color the points by their cluster label. Also specify alpha=0.5.
    Compute the coordinates of the centroids using the .cluster_centers_ attribute of model.
    Assign column 0 of centroids to centroids_x, and column 1 of centroids to centroids_y.
    Make a scatter plot of centroids_x and centroids_y, using 'D' (a diamond) as a marker by specifying the marker parameter. Set the size of the markers to be 50 using s=50.


In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,c=labels,alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x,centroids_y,marker='D',s=50)
plt.show()


## How to measure clustering quality (metrics)

A good clustering has tight clusters, meaning that samples in each cluster are bunched together, not spread out.

- Lower intertia is better
- Not too many clusters

Using elbow method

In [None]:
from sklearn.cluster import Kmeans

model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)

How many clusters of grain?

In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

KMeans and PyPlot (plt) have already been imported for you.

This dataset was sourced from the UCI Machine Learning Repository.


    For each of the given values of k, perform the following steps:
    Create a KMeans instance called model with k clusters.
    Fit the model to the grain data samples.
    Append the value of the inertia_ attribute of model to the list inertias.
    The code to plot ks vs inertias has been written for you, so hit submit to see the plot!


In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model=KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()


### Evaluating the grain clustering with Crosstab -- **IMPORTANT**: using this method only when result is known

In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain samples, and a list varieties giving the grain variety for each sample. Pandas (pd) and KMeans have already been imported for you.


    Create a KMeans model called model with 3 clusters.
    Use the .fit_predict() method of model to fit it to samples and derive the cluster labels. Using .fit_predict() is the same as using .fit() followed by .predict().
    Create a DataFrame df with two columns named 'labels' and 'varieties', using labels and varieties, respectively, for the column values. This has been done for you.
    Use the pd.crosstab() function on df['labels'] and df['varieties'] to count the number of times each grain variety coincides with each cluster label. Assign the result to ct.
    Hit submit to see the cross-tabulation!


In [None]:
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)

Result:

varieties  Canadian wheat  Kama wheat  Rosa wheat
labels                                           
0                       0           1          60
1                      68           9           0
2                       2          60          10

### Tranforming features for better clusterings

**Note**

    - If the data points are close to the mean, the variance is low.

    - If the data points are spread out from the mean, the variance is high.

    - Variance can be influenced heavily by outliers (extreme values that differ significantly from the rest of the data).
    
    - Variance is in the square of the units of the original data. For example, if the data is measured in meters, the variance will be in square meters.

| Feature | Variance |
|---------|----------|
| Alcohol | 0.65     |
| Malic_ac| 1.24     |
....
| od280   | 0.50     |
| Proline | 99166.71 |

Proline has high variance. It affects the result of clusters.

Using sklearn StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(samples)

StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)

Using Pipeline:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)

pipeline.predict(samples)

Before using Standard scaler:
| Labels  | Barbera  | Barolo | Grignolino |
|---------|----------|--------|------------|
| 0       | 29       | 13     | 20         |
| 1       | 0        | 46     | 1          |
| 2       | 19       | 0      | 50         |

These clusters weren't good because each label should contain single type. For example, label should only contain Barolo type or mostly contain Barolo. 


After using standardization

| Labels  | Barbera  | Barolo | Grignolino |
|---------|----------|--------|------------|
| 0       | 0        | 59     | 3          |
| 1       | 48       | 0      | 3          |
| 2       | 0        | 0      | 65         |

These clusters are better

**Note** other preprocessing methods: Normalization and MaxAbsScaler

MaxAbsScaler for **non-normal distribution**
Ex: 
Original data:
 [[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

Scaled data:
 [[0.14285714 0.25       0.33333333]
 [0.57142857 0.625      0.66666667]
 [1.         1.         1.        ]]

To check if the data normal distribution or non normal distribution, using **Anderson-Darling test** 

In [None]:
import numpy as np
from scipy.stats import anderson

# Generate some example data (non-normally distributed)
data = np.random.exponential(size=100)

# Perform the Anderson-Darling test
result = anderson(data)

print("Statistic:", result.statistic)
print("Critical Values:", result.critical_values)
print("Significance Levels:", result.significance_level)

# Interpret the result
print("\nInterpretation:")
if result.statistic < result.critical_values[2]:
    print("Data looks normal at the 5% significance level.")
else:
    print("Data does not look normal at the 5% significance level.")


In summary, the main differences between StandardScaler and Normalizer are:

**Purpose**: StandardScaler aims to bring features to a standard scale with mean 0 and standard deviation 1, while Normalizer scales the magnitude of individual samples to 1.

**Application**: StandardScaler is often used when features have different scales, and Normalizer is used when you want to emphasize the direction of data points.

**Effect**: StandardScaler retains information about the distribution and spread of the data, while Normalizer emphasizes the relative relationships between data points.

**Direction**: StandardScaler operates column-wise, and Normalizer operates row-wise.

**StandardScaler** is typically applied column-wise, which means each feature is scaled independently of the others.

**Normalizer** is applied row-wise, treating each sample as a **vector** and scaling its magnitude to 1 while preserving the direction.

In [None]:
# EXAMPLE OF STANDARDLIZATION

# example of a standardization
from numpy import asarray
from sklearn.preprocessing import StandardScaler
# define data
data = asarray([[100, 0.001],
                 [8, 0.05],
                 [50, 0.005],
                 [88, 0.07],
                 [4, 0.1]])
print(data)
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)

[[1.0e+02 1.0e-03]
 [8.0e+00 5.0e-02]
 [5.0e+01 5.0e-03]
 [8.8e+01 7.0e-02]
 [4.0e+00 1.0e-01]]

[[ 1.26398112 -1.16389967]
 [-1.06174414  0.12639634]
 [ 0.         -1.05856939]
 [ 0.96062565  0.65304778]
 [-1.16286263  1.44302493]]

 **Note** assume the data has normal distribution, old mean is 50 for 1st feature, then new mean is 0. For the 2nd feature, the old mean is 0.0452 and the new mean is 0.