# Dimension reduction


Here, we demonstrate how PyBDA can be used for dimension reduction. We use the `iris` data, because we know _how_ we want the different plants to be clustered. We'll use PCA, factor analysis and LDA for the dimension reduction and embed it into a two-dimensional space.

We activate our environment first:

In [1]:
source ~/miniconda3/bin/activate pybda

(pybda) 

: 1

We already provided an example how dimension reduction can be used in the `data` folder. It is fairly simple:

In [2]:
cd data

(pybda) 

: 1

In [3]:
cat pybda-usecase-dimred.config

spark: spark-submit
infile: iris.tsv
outfolder: results
meta: iris_meta_columns.tsv
features: iris_feature_columns.tsv
dimension_reduction: pca, factor_analysis, lda
n_components: 2
response: Species
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda) 

: 1

In the config above we will do the following:

* do three dimensionality reductions to two dimensions on the features in `iris_feature_columns.tsv`,
* for the LDA use the response variable `Species`,
* give the Spark driver 1G of memory and the executor 1G of memory,
* write the results to `results`,
* print debug information.

As can be seen, the effort to implement the three embedings is minimal.

We execute PyBDA like this:

In [4]:
pybda dimension-reduction pybda-usecase-dimred.config local

Checking command line arguments for method: dimension_reduction
[1;33m Printing rule tree:
 -> _ (, iris.tsv)
	 -> dimension_reduction (iris.tsv, results/2019_08_08/lda_from_iris.tsv)
	 -> dimension_reduction (iris.tsv, results/2019_08_08/factor_analysis_from_iris.tsv)
	 -> dimension_reduction (iris.tsv, results/2019_08_08/pca_from_iris.tsv)
[0m
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /bin/bash[0m
[33mProvided cores: 1[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob counts:
	count	jobs
	1	factor_analysis
	1	lda
	1	pca
	3[0m
	count	jobs
	1	factor_analysis
	1	lda
	1	pca
	3
[32m[0m
[2019-08-08 23:07:17,268 - INFO - snakemake.logging]: 
[32m[Thu Aug  8 23:07:17 2019][0m
[2019-08-08 23:07:17,268 - INFO - snakemake.logging]: [Thu Aug  8 23:07:17 2019]
[32mrule lda:
    input: iris.tsv
    output: results/2019_08_08/lda.tsv, results/2019_08_08/lda-projection.tsv, results/2019_08_08/lda-plot
    jobid: 0[0m
[2019-08-08 23:07:17,268 - INFO - snakemak

: 1

After the three methods ran, we should check the plots and statistics. Let's see what we got:

In [5]:
cd results
ls -lgG *

(pybda) total 840
-rw-rw-r-- 1    190 Aug  8 23:09 factor_analysis_from_iris-loadings.tsv
-rw-rw-r-- 1   4881 Aug  8 23:09 factor_analysis_from_iris.log
-rw-rw-r-- 1    483 Aug  8 23:09 factor_analysis_from_iris-loglik.tsv
drwxrwxr-x 2   4096 Aug  8 23:09 [0m[01;34mfactor_analysis_from_iris-plot[0m
-rw-rw-r-- 1 319409 Aug  8 23:09 factor_analysis_from_iris-spark.log
-rw-r--r-- 1  12780 Aug  8 23:09 factor_analysis_from_iris.tsv
-rw-rw-r-- 1   2812 Aug  8 23:07 lda.log
drwxrwxr-x 2   4096 Aug  8 23:07 [01;34mlda-plot[0m
-rw-rw-r-- 1    346 Aug  8 23:07 lda-projection.tsv
-rw-rw-r-- 1 343222 Aug  8 23:07 ldaspark.log
-rw-r--r-- 1  12541 Aug  8 23:07 lda.tsv
-rw-rw-r-- 1    348 Aug  8 23:08 pca_from_iris-loadings.tsv
-rw-rw-r-- 1   2987 Aug  8 23:08 pca_from_iris.log
drwxrwxr-x 2   4096 Aug  8 23:08 [01;34mpca_from_iris-plot[0m
-rw-rw-r-- 1 101682 Aug  8 23:08 pca_from_iris-spark.log
-rw-r--r-- 1  12749 Aug  8 23:08 pca_from_iris.tsv
(pybda) 

: 1

It should be interesting to look at the different embeddings (since we cannot open them from the command line, we load pre-computed plots).

First, the embedding of the *PCA*:

<img src="_static/examples/pca.svg" width="500"/>

The embedding of the *factor analysis*:

<img src="_static/examples/fa.svg" width="500"/>

Finally, the embedding of the *LDA*. Since, LDA needs a response variable to work, when we create a plot, we include this info:

<img src="_static/examples/lda.svg" width="500"/>

PyBDA creates many other files and plots. It is, for instance, always important to look at `log` files:

In [6]:
head */pca_from_iris.log

[2019-08-08 23:08:01,888 - INFO - pybda.spark_session]: Initializing pyspark session
[2019-08-08 23:08:02,889 - INFO - pybda.spark_session]: Config: spark.master, value: local
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.driver.port, value: 42629
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.app.id, value: local-1565298482500
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.driver.memory, value: 1G
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.rdd.compress, value: True
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.serializer.objectStreamReset, value: 100
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.driver.host, value: 192.168.1.33
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.executor.id, value: driver
[2019-08-08 23:08:02,890 - INFO - pybda.spark_session]: Config: spark.submit.deployMode, value: client
(pybda) 

: 1

Furthermore, the Spark `log` file is sometimes important to look at when the methods failed:

In [7]:
cat */pca_from_iris-spark.log

2019-08-08 23:08:00 WARN  Utils:66 - Your hostname, hoto resolves to a loopback address: 127.0.1.1; using 192.168.1.33 instead (on interface wlp2s0)
2019-08-08 23:08:00 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-08-08 23:08:00 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-08 23:08:01 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-08-08 23:08:01 INFO  SparkContext:54 - Submitted application: pca.py
2019-08-08 23:08:01 INFO  SecurityManager:54 - Changing view acls to: simon
2019-08-08 23:08:01 INFO  SecurityManager:54 - Changing modify acls to: simon
2019-08-08 23:08:01 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-08-08 23:08:01 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-08-08 23:08:01 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(simon);

2019-08-08 23:08:05 INFO  CodeGenerator:54 - Code generated in 15.579257 ms
2019-08-08 23:08:05 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 285.0 KB, free 366.0 MB)
2019-08-08 23:08:05 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 366.0 MB)
2019-08-08 23:08:05 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 192.168.1.33:36883 (size: 23.4 KB, free: 366.3 MB)
2019-08-08 23:08:05 INFO  SparkContext:54 - Created broadcast 0 from csv at NativeMethodAccessorImpl.java:0
2019-08-08 23:08:05 INFO  FileSourceScanExec:54 - Planning scan with bin packing, max size: 4201044 bytes, open cost is considered as scanning 4194304 bytes.
2019-08-08 23:08:05 INFO  SparkContext:54 - Starting job: csv at NativeMethodAccessorImpl.java:0
2019-08-08 23:08:05 INFO  DAGScheduler:54 - Got job 0 (csv at NativeMethodAccessorImpl.java:0) with 1 output partitions
2019-08-08 23:08:05 INFO  DAGScheduler:54

2019-08-08 23:08:06 INFO  SparkContext:54 - Created broadcast 4 from broadcast at DAGScheduler.scala:1161
2019-08-08 23:08:06 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[15] at treeAggregate at RowMatrix.scala:419) (first 15 tasks are for partitions Vector(0))
2019-08-08 23:08:06 INFO  TaskSchedulerImpl:54 - Adding task set 1.0 with 1 tasks
2019-08-08 23:08:06 INFO  TaskSetManager:54 - Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 8325 bytes)
2019-08-08 23:08:06 INFO  Executor:54 - Running task 0.0 in stage 1.0 (TID 1)
2019-08-08 23:08:07 INFO  FileScanRDD:54 - Reading File path: file:///home/simon/PROJECTS/pybda/data/iris.tsv, range: 0-6740, partition values: [empty row]
2019-08-08 23:08:07 INFO  CodeGenerator:54 - Code generated in 10.110679 ms
2019-08-08 23:08:07 INFO  PythonRunner:54 - Times: total = 573, boot = 340, init = 230, finish = 3
2019-08-08 23:08:07 INFO  Executor:54 - Finished ta

2019-08-08 23:08:07 INFO  Executor:54 - Running task 0.0 in stage 4.0 (TID 4)
2019-08-08 23:08:07 INFO  FileScanRDD:54 - Reading File path: file:///home/simon/PROJECTS/pybda/data/iris.tsv, range: 0-6740, partition values: [empty row]
2019-08-08 23:08:08 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-08-08 23:08:08 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
2019-08-08 23:08:08 INFO  PythonRunner:54 - Times: total = 227, boot = 3, init = 221, finish = 3
2019-08-08 23:08:08 INFO  Executor:54 - Finished task 0.0 in stage 4.0 (TID 4). 1999 bytes result sent to driver
2019-08-08 23:08:08 INFO  TaskSetManager:54 - Finished task 0.0 in stage 4.0 (TID 4) in 243 ms on localhost (executor driver) (1/1)
2019-08-08 23:08:08 INFO  TaskSchedulerImpl:54 - Removed TaskSet 4.0, whose tasks have all completed, from pool 
2019-08-08 23:08:08 INFO  DAGScheduler:54 - ResultStage 4 (treeAggregate at RowMatrix.

2019-08-08 23:08:19 INFO  BlockManagerInfo:54 - Removed broadcast_5_piece0 on 192.168.1.33:36883 in memory (size: 10.7 KB, free: 366.2 MB)
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 88
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 109
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 176
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 90
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 167
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 52
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 75
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 102
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 170
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 123
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 142
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 128
2019-08-08 23:08:19 INFO 

2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 130
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 99
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 174
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 59
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 115
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 117
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 177
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 50
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 191
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 71
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 184
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 189
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 42
2019-08-08 23:08:19 INFO  ContextCleaner:54 - Cleaned accumulator 116
2019-08-08 23:08:19 INFO 

2019-08-08 23:08:19 INFO  CodeGenerator:54 - Code generated in 5.887212 ms
2019-08-08 23:08:19 INFO  PythonRunner:54 - Times: total = 45, boot = -10859, init = 10901, finish = 3
2019-08-08 23:08:19 INFO  CodeGenerator:54 - Code generated in 22.57845 ms
2019-08-08 23:08:19 INFO  PythonRunner:54 - Times: total = 241, boot = 3, init = 237, finish = 1
2019-08-08 23:08:19 INFO  Executor:54 - Finished task 0.0 in stage 8.0 (TID 8). 2250 bytes result sent to driver
2019-08-08 23:08:19 INFO  TaskSetManager:54 - Finished task 0.0 in stage 8.0 (TID 8) in 466 ms on localhost (executor driver) (1/1)
2019-08-08 23:08:19 INFO  TaskSchedulerImpl:54 - Removed TaskSet 8.0, whose tasks have all completed, from pool 
2019-08-08 23:08:19 INFO  DAGScheduler:54 - ResultStage 8 (take at /home/simon/PROJECTS/pybda/pybda/spark/features.py:181) finished in 0.474 s
2019-08-08 23:08:19 INFO  DAGScheduler:54 - Job 8 finished: take at /home/simon/PROJECTS/pybda/pybda/spark/features.py:181, took 0.476611 s
2019-08-0

2019-08-08 23:08:20 INFO  SparkHadoopMapRedUtil:54 - attempt_20190808230820_0010_m_000000_0: Committed
2019-08-08 23:08:20 INFO  Executor:54 - Finished task 0.0 in stage 10.0 (TID 10). 4019 bytes result sent to driver
2019-08-08 23:08:20 INFO  TaskSetManager:54 - Finished task 0.0 in stage 10.0 (TID 10) in 552 ms on localhost (executor driver) (1/1)
2019-08-08 23:08:20 INFO  TaskSchedulerImpl:54 - Removed TaskSet 10.0, whose tasks have all completed, from pool 
2019-08-08 23:08:20 INFO  DAGScheduler:54 - ResultStage 10 (csv at NativeMethodAccessorImpl.java:0) finished in 0.573 s
2019-08-08 23:08:20 INFO  DAGScheduler:54 - Job 10 finished: csv at NativeMethodAccessorImpl.java:0, took 0.578634 s
2019-08-08 23:08:20 INFO  FileFormatWriter:54 - Write Job 60e5f028-c6ae-4bb0-932e-6391807683b7 committed.
2019-08-08 23:08:20 INFO  FileFormatWriter:54 - Finished processing stats for write job 60e5f028-c6ae-4bb0-932e-6391807683b7.
2019-08-08 23:08:21 INFO  FileSourceStrategy:54 - Pruning directo

2019-08-08 23:08:21 INFO  MemoryStore:54 - Block broadcast_23_piece0 stored as bytes in memory (estimated size 3.8 KB, free 361.0 MB)
2019-08-08 23:08:21 INFO  BlockManagerInfo:54 - Added broadcast_23_piece0 in memory on 192.168.1.33:36883 (size: 3.8 KB, free: 366.0 MB)
2019-08-08 23:08:21 INFO  SparkContext:54 - Created broadcast 23 from broadcast at DAGScheduler.scala:1161
2019-08-08 23:08:21 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 13 (MapPartitionsRDD[63] at count at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
2019-08-08 23:08:21 INFO  TaskSchedulerImpl:54 - Adding task set 13.0 with 1 tasks
2019-08-08 23:08:21 INFO  TaskSetManager:54 - Starting task 0.0 in stage 13.0 (TID 13, localhost, executor driver, partition 0, ANY, 7767 bytes)
2019-08-08 23:08:21 INFO  Executor:54 - Running task 0.0 in stage 13.0 (TID 13)
2019-08-08 23:08:21 INFO  ShuffleBlockFetcherIterator:54 - Getting 1 non-empty blocks including 1 local blocks

2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 409
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 418
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 402
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 392
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 410
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 335
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 386
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 310
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 420
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 351
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 338
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 323
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 389
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 384
2019-08-08 23:08:21 

2019-08-08 23:08:21 INFO  BlockManagerInfo:54 - Removed broadcast_2_piece0 on 192.168.1.33:36883 in memory (size: 23.4 KB, free: 366.1 MB)
2019-08-08 23:08:21 INFO  TaskSetManager:54 - Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 8325 bytes)
2019-08-08 23:08:21 INFO  Executor:54 - Running task 0.0 in stage 14.0 (TID 14)
2019-08-08 23:08:21 INFO  FileScanRDD:54 - Reading File path: file:///home/simon/PROJECTS/pybda/data/iris.tsv, range: 0-6740, partition values: [empty row]
2019-08-08 23:08:21 INFO  Executor:54 - Finished task 0.0 in stage 14.0 (TID 14). 2789 bytes result sent to driver
2019-08-08 23:08:21 INFO  TaskSetManager:54 - Finished task 0.0 in stage 14.0 (TID 14) in 14 ms on localhost (executor driver) (1/1)
2019-08-08 23:08:21 INFO  TaskSchedulerImpl:54 - Removed TaskSet 14.0, whose tasks have all completed, from pool 
2019-08-08 23:08:21 INFO  DAGScheduler:54 - ResultStage 14 (take at /home/simon/PROJECTS/pybda/pybda/spark/f

2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 214
2019-08-08 23:08:21 INFO  ContextCleaner:54 - Cleaned accumulator 221
2019-08-08 23:08:21 INFO  MemoryStore:54 - Block broadcast_27 stored as values in memory (estimated size 48.6 KB, free 363.3 MB)
2019-08-08 23:08:21 INFO  MemoryStore:54 - Block broadcast_27_piece0 stored as bytes in memory (estimated size 20.4 KB, free 363.3 MB)
2019-08-08 23:08:21 INFO  BlockManagerInfo:54 - Added broadcast_27_piece0 in memory on 192.168.1.33:36883 (size: 20.4 KB, free: 366.2 MB)
2019-08-08 23:08:21 INFO  SparkContext:54 - Created broadcast 27 from broadcast at DAGScheduler.scala:1161
2019-08-08 23:08:21 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 15 (MapPartitionsRDD[72] at take at /home/simon/PROJECTS/pybda/pybda/spark/features.py:181) (first 15 tasks are for partitions Vector(0))
2019-08-08 23:08:21 INFO  TaskSchedulerImpl:54 - Adding task set 15.0 with 1 tasks
2019-08-08 23:08:21 INFO  TaskSetManager:5

2019-08-08 23:08:22 INFO  Executor:54 - Running task 0.0 in stage 17.0 (TID 17)
2019-08-08 23:08:22 INFO  FileScanRDD:54 - Reading File path: file:///home/simon/PROJECTS/pybda/data/iris.tsv, range: 0-6740, partition values: [empty row]
2019-08-08 23:08:22 INFO  CodeGenerator:54 - Code generated in 5.322336 ms
2019-08-08 23:08:22 INFO  PythonRunner:54 - Times: total = 48, boot = -815, init = 859, finish = 4
2019-08-08 23:08:22 INFO  CodeGenerator:54 - Code generated in 11.984246 ms
2019-08-08 23:08:22 INFO  PythonRunner:54 - Times: total = 54, boot = -374, init = 426, finish = 2
2019-08-08 23:08:22 INFO  PythonUDFRunner:54 - Times: total = 59, boot = -369, init = 420, finish = 8
2019-08-08 23:08:22 INFO  PythonUDFRunner:54 - Times: total = 230, boot = 5, init = 60, finish = 165
2019-08-08 23:08:22 INFO  Executor:54 - Finished task 0.0 in stage 17.0 (TID 17). 11705 bytes result sent to driver
2019-08-08 23:08:22 INFO  TaskSetManager:54 - Finished task 0.0 in stage 17.0 (TID 17) in 273 ms

: 1