# Combining multiple tasks at once

Often, we are interested in combining several methods at once. This notebook shows how it is done! Here, we use dimension reduction + clustering + regression at the same time with **one**, simple config file.

We start by loading our designated `pybda` environment:

In [1]:
source ~/miniconda3/bin/activate pybda

(pybda) 

: 1

To run combinations of methods and models, we simply need to list them all in the same config file. We deposited one in the `data` folder of `pybda`:

In [2]:
cd data

(pybda) 

: 1

In [4]:
cat pybda-usecase-dimred+clustering+regression.config

spark: spark-submit
infile: single_cell_imaging_data.tsv
outfolder: results
meta: meta_columns.tsv
features: feature_columns.tsv
dimension_reduction: pca, ica
n_components: 5
clustering: kmeans, gmm
n_centers: 50, 100
regression: forest, glm
response: is_infected
family: binomial
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda) 

: 1

The config file above we will do the following:

* fit a PCA and ICA to `single_cell_imaging_data.tsv` using 5 components, 
* from the two results of PCA and ICA, do a $k$-means and a GMM clustering with 50, or 100, cluster centers, respectively,
* regress the `response` column on the features in `feature_columns.tsv` using a random forest and a GLM,
* use a `binomial` family variable,
* give the Spark driver 1G of memory and the executor 1G of memory,
* write the results to `results`,
* print debug information.

That's all we need to do!

We then call `pybda` from the command line. Usually we would want to call `pybda` with a specific target (i.e., *clustering*, *dimension-reduction*, or *regression*) such that we do not run everything. However, in this case,
where we **want** to execute everything, we call it with *run*.

In [5]:
pybda run pybda-usecase-dimred+clustering+regression.config local

Checking command line arguments for method: regression
Checking command line arguments for method: dimension_reduction
Checking command line arguments for method: clustering
[1;33m Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
	 -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
	 -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
	 -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
		 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
		 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
	 -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
		 -> clustering (results/201

Traceback (most recent call last):
  File "/Users/simondi/miniconda3/envs/pybda/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/Users/simondi/miniconda3/envs/pybda/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/Users/simondi/miniconda3/envs/pybda/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/Users/simondi/miniconda3/envs/pybda/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
    raise EOFError
EOFError
[32m[Thu Aug 15 17:12:15 2019][0m
[2019-08-15 17:12:15,078 - INFO - snakemake.logging]: [Thu Aug 15 17:12:15 2019]
[32mFinished job 2.[0m
[2019-08-15 17:12:15,079 - INFO - snakemake.logging]: Finished job 2.
[32m1 of 6 steps (17%) done[0m
[2019-08-15 17:12:15,079 - INFO - snakemake.logging]: 

[2019-08-15 17:13:32,296 - INFO - snakemake.logging]: rule pca:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-plot
    jobid: 1
[32m[0m
[2019-08-15 17:13:32,297 - INFO - snakemake.logging]: 
[1;33m Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
	 -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
	 -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
	 -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
		 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
		 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/20

results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.pdf
results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.eps
results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.svg
results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.png
results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.pdf
results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.eps
results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.svg
results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K50-clusters
results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K100-clusters
results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters
results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
This might be due to filesystem latency. If that is the case, consider to increase the wait time wi

: 1