# Clustering and mixture models


The following `notebook` shows how PyBDA can be used for a simple clustering task. We use the `iris` data, because clustering it is fairly easy and we can use it as a check if everything worked out nicely. We'll use a $k$-means clustering and compare it to a Gaussian mixture model (GMM). The $k$-means has fairly stringent assumptions about the data, i.e. spherical Gaussians, while the GMM estimates the variances from the data.

We start by activating our environment:

In [1]:
source ~/miniconda3/bin/activate pybda

(pybda) 

: 1

In order to do two clusterings, we merely need to set up a short config file with the two method names. We already provided a file that could do the trick for us in the `data` folder:

In [2]:
cd data

(pybda) 

: 1

In [3]:
cat pybda-usecase-clustering.config

spark: spark-submit
infile: iris.tsv
outfolder: results
meta: iris_meta_columns.tsv
features: iris_feature_columns.tsv
clustering: kmeans,gmm
n_centers: 3, 5, 10
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda) 

: 1

In the config above we will do the following:

* do clustering/mixture model with 3, 5 and 10 cluster centers on the features provided in `iris_feature_columns.tsv`,
* give the Spark driver 1G of memory and the executor 1G of memory,
* write the results to `results`,
* print debug information.

This totals 6 clusterings from two different methods with minimal coding effort.

Having the parameters set, we can call PyBDA:

In [4]:
pybda clustering pybda-usecase-clustering.config local | head -n 10

[33mBuilding DAG of jobs...[0m
[33mUsing shell: /bin/bash[0m
[33mProvided cores: 1[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob counts:
	count	jobs
	1	gmm
	1	kmeans
	2[0m
	count	jobs
	1	gmm
	1	kmeans
	2
[32m[0m
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: 
[32m[Thu Aug  8 22:22:30 2019][0m
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: [Thu Aug  8 22:22:30 2019]
[32mrule gmm:
    input: iris.tsv
    output: results/2019_08_08/gmm_from_iris, results/2019_08_08/gmm_from_iris-profile.png, results/2019_08_08/gmm_from_iris-profile.pdf, results/2019_08_08/gmm_from_iris-profile.eps, results/2019_08_08/gmm_from_iris-profile.svg, results/2019_08_08/gmm_from_iris-profile.tsv, results/2019_08_08/gmm_from_iris-transformed-K3-components, results/2019_08_08/gmm_from_iris-transformed-K5-components, results/2019_08_08/gmm_from_iris-transformed-K10-components
    jobid: 0[0m
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: rule gmm:
    input: 

: 1

The call automatically executes the jobs defined in the config. After both ran, we should check the plots and statistics. Let's see what we got:

In [5]:
cd results
ls -lgG *

(pybda) total 3196
drwxrwxr-x 5    4096 Aug  8 22:22 [0m[01;34mgmm_from_iris[0m
-rw-rw-r-- 1   41630 Aug  8 22:22 gmm_from_iris-cluster_sizes-histogram.eps
-rw-rw-r-- 1   11913 Aug  8 22:22 gmm_from_iris-cluster_sizes-histogram.pdf
-rw-rw-r-- 1   94191 Aug  8 22:22 [01;35mgmm_from_iris-cluster_sizes-histogram.png[0m
-rw-rw-r-- 1   54900 Aug  8 22:22 [01;35mgmm_from_iris-cluster_sizes-histogram.svg[0m
-rw-rw-r-- 1    6332 Aug  8 22:22 gmm_from_iris.log
-rw-rw-r-- 1   22803 Aug  8 22:22 gmm_from_iris-profile.eps
-rw-rw-r-- 1   13040 Aug  8 22:22 gmm_from_iris-profile.pdf
-rw-rw-r-- 1  218232 Aug  8 22:22 [01;35mgmm_from_iris-profile.png[0m
-rw-rw-r-- 1   31619 Aug  8 22:22 [01;35mgmm_from_iris-profile.svg[0m
-rw-rw-r-- 1     137 Aug  8 22:22 gmm_from_iris-profile.tsv
-rw-rw-r-- 1 1288436 Aug  8 22:22 gmm_from_iris-spark.log
drwxrwxr-x 2    4096 Aug  8 22:22 [01;34mgmm_from_iris-transformed-K10-components[0m
drwxrwxr-x 2    4096 Aug  8 22:22 [01;34mgmm_from_iris-transformed-

: 1

Now, finally let's check how many clusters/components are recommended for each method:

In [6]:
cat */kmeans_from_iris-profile.tsv

k	within_cluster_variance	explained_variance	total_variance	BIC	
3	78.85566447695781	0.8842690519576577	681.3705911067716	143.99392330020913	
5	46.71230004910447	0.931443621637341	681.3705911067716	151.93564122512583	
10	32.49992444218044	0.9523021321049536	681.3705911067716	237.93597150012693	
(pybda) 

: 1

In [7]:
cat */gmm_from_iris-profile.tsv

k	loglik	BIC	
3	-189.53852954643244	599.5450120331002	
5	-154.60572954901806	679.998470861159	
10	-50.516420858075136	847.6175005364923	
(pybda) 

: 1

So in both cases the optimal number would be three! Just as expected from the `iris` data. Nice!

There's plenty of other files and plots available to check out, though! For instance, we should _always_ look at the `log` files we created to check some params, and what we actually computed:

In [8]:
cat */gmm_from_iris.log

[2019-08-08 22:22:33,019 - INFO - pybda.spark_session]: Initializing pyspark session
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.master, value: local
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.memory, value: 1G
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.rdd.compress, value: True
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.serializer.objectStreamReset, value: 100
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.app.name, value: gmm.py
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.executor.id, value: driver
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.port, value: 37957
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.app.id, value: local-1565295753672
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.submit.deployMode, value: client
[2019-08-08 22:22:34,089 - 

: 1

Furthermore, the Spark `log` file is sometimes important to look at when the methods failed:

In [10]:
head */gmm_from_iris-spark.log

2019-08-08 22:22:31 WARN  Utils:66 - Your hostname, hoto resolves to a loopback address: 127.0.1.1; using 192.168.1.33 instead (on interface wlp2s0)
2019-08-08 22:22:31 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-08-08 22:22:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-08 22:22:33 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-08-08 22:22:33 INFO  SparkContext:54 - Submitted application: gmm.py
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing view acls to: simon
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing modify acls to: simon
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-08-08 22:22:33 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(simon);

: 1