# Welcome to the Dark Art of Coding:
## Introduction to Machine Learning
k-Means Clustering

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Cover an overview of k-Means Clustering
* Examine code samples that walk us through **The Process™**:
   * Prep the data
   * Choose the model
   * Choose appropriate hyperparameters
   * Fit the model
   * Apply the model
   * Examine the results
* Explore a deep dive into this model
* Review some gotchas that might complicate things
* Review tips related to learning more

# Overview: k-Means Clustering
---

Clustering models are popular machine learning models because they:

* are unsupervised and thus don't require pre-determined labels
* can accommodate multidimensional datasets
* can, for simple cases, be fairly easy to interpret

k-Means Clustering algorithms: 

* look for the arithmetic mean of all points in a cluster to identify the cluster centers
* group points together by identifying the closest cluster center

For this example, we will use the KMeans model. The sklearn.cluster module has a number of clustering models, including:

* AffinityPropagation
* DBSCAN
* KMeans
* MeanShift
* SpectralClustering
* and more...

With this background, let's apply **The Process™** on the KMeans Clustering model.

## Prep the data

We start with a set of standard imports...

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

# NOTE: during the Choose the Model step, we will import the 
#     model we want, but there is no reason you can't import it here.
# from sklearn.cluster import KMeans

### Prep the training Data

As mentioned, a number of data generating functions exist in Scikit Learn to help you create data sets that you can use to play with an manipulate the models. For this example, I want to explore one of these data generation libraries: 

```python
sklearn.datasets.samples_generator.make_blobs
```

These dataset generators produce preformatted `features` matrices and `target` arrays.

This dataset is composed of a `features matrix` of `x` and `y` vectors that can be plotted on a chart and a `target array` of cluster identifiers.

In [None]:
from sklearn.datasets.samples_generator import make_blobs
X_train, y_true = make_blobs(n_samples=400,
                             centers=4,
                             cluster_std=0.70,
                             random_state=13)

It can be really useful to take a look at the features matrix and target array of the training data. 

* In the raw form
* In a visualization tool

For this dataset, let's use a scatter plot.

In [None]:
X_train.shape

In [None]:
X_train[:5]

In [None]:
y_true[:5]

In [None]:
plt.scatter(X_train[:, 0], X_train[:, 1])
plt.title("Four clusters");

### Prep the test data

In this case, the X_train data will also be our testing data. We know what cluster categories were assigned by the make_blob function, but we want to see if the KMeans model will correctly identify them.

## Choose the Model

In [None]:
from sklearn.cluster import KMeans

## Choose Appropriate Hyperparameters

Here we choose to assign xx hyperparameters: `xx` and `xx`. We will discuss both later.

In [None]:
model = KMeans(n_clusters=4)

There are a number of hyperparameters... we will cover several in greater depth later.

```python
KMeans(
    n_clusters=8,
    init='k-means++',
    n_init=10,
    max_iter=300,
    tol=0.0001,
    precompute_distances='auto',
    verbose=0,
    random_state=None,
    copy_x=True,
    n_jobs=None,
    algorithm='auto',
)
```

## Fit the Model

This model doesn't need OR use any labels, so we simply feed in the `X_train` data.

In [None]:
model.fit(X_train)

## Apply the Model

In [None]:
y_pred = model.predict(X_train)

In [None]:
y_pred.shape

In [None]:
y_pred[:5]

# array([1, 3, 2, 1, 3]) for comparison's sake, from y_true.
# while the numbers won't be the same...
#     the categorization should be pretty close

## Examine the results

In [None]:
plt.scatter(X_train[:, 0], X_train[:, 1],
            c=y_pred,
            cmap='seismic', alpha=0.2)

In [None]:
ctrs = model.cluster_centers_

In [None]:
plt.scatter(X_train[:, 0], X_train[:, 1],
            c=y_pred,
            cmap='seismic', alpha=0.2)

plt.scatter(ctrs[:, 0], ctrs[:, 1],
            c='white',
            edgecolors='black',
            s=150,
            )

# Gotchas
---

The k-Means Clustering model works based on a process called Expectation-Maximization. In this process, the model:

* starts by randomly picking some cluster centers
* repeats the following cycle until the model converges
    * Expectation: assign points to the closest cluster center
    * Maximization: set new cluster centers to the mean of the points in the cluster

The process is designed such that for every cycle of the Expectation and Maximization steps, the model will always have a better estimation of any given cluster.

**No global guarantees**: despite the promise of convergence... there is no guarantee that as a whole the clusters produced will globally be the most suitable clusters.

It really depends on the randomly selected initial cluster centers. To overcome this limitation, the model typically runs the algorithm multiple times. The default `n_init` is set at `10`.

**You must decide on the number of clusters**: when we set the hyperparameters, we need to initialize the model with the right number of clusters. The default `n_clusters` is set at `8`.

* There are other models that may provide some measure of the fitness of the number of clusters: `GaussianMixtrue`
* There are other models that can choose a suitable number of clusters: `DBSCAN`, `MeanShift`

**Speed considerations**: can be slow on large datasets.

# Deep Dive
---

# How to learn more: tips and hints
---

# Experience Points!
---

# delete_this_line: task 01

In **`jupyter`** create a simple script to complete the following tasks:


**REPLACE THE FOLLOWING**

Create a function called `me()` that prints out 3 things:

* Your name
* Your favorite food
* Your favorite color

Lastly, call the function, so that it executes when the script is run

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: task 02

In **`jupyter`** create a simple script to complete the following tasks:

**REPLACE THE FOLLOWING**

Task | Sample Object(s)
:---|:---
Compare two items using `and` | 'Bruce', 0
Compare two items using `or` | '', 42
Use the `not` operator to make an object False | 'Selina' 
Compare two numbers using comparison operators | `>, <, >=, !=, ==`
Create a more complex/nested comparison using parenthesis and Boolean operators| `('kara' _ 'clark') _ (0 _ 0.0)`

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: sample 03

In your **text editor** create a simple script called:

```bash
my_lessonname_03.py```

Execute your script on the command line using **`ipython`** via this command:

```bash
ipython -i my_lessonname_03.py```

**REPLACE THE FOLLOWING**

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

1. Create a variable with your first name as a string AND save it with the label: `myfname`.
1. Create a variable with your age as an integer AND save it with the label: `myage`.

1. Use `input()` to prompt for your first name AND save it with the label: `fname`.
1. Create an `if` statement to test whether `fname` is equivalent to `myfname`. 
1. In the `if` code block: 
   1. Use `input()` prompt for your age AND save it with the label: `age` 
   1. NOTE: don't forget to convert the value to an integer.
   1. Create a nested `if` statement to test whether `myage` and `age` are equivalent.
1. If both tests pass, have the script print: `Your identity has been verified`

When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# References
---

Below are references that may assist you in learning more:
    
|Title (link)|Comments|
|---|---|
|[General API Reference](https://scikit-learn.org/stable/modules/classes.html)||
|[KMeans API Reference](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)||
|[User Guide](https://scikit-learn.org/stable/modules/clustering.html#k-means)||
|[Sample datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets)|Load or create datasets for practice and study|
|[Make blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs)|Specifically make clusters of values|