# Regression

PyBDA supports several methods for regression. Here, we show how random forests and gradient boosting can be used to predict a response variable from a set of covariables. We use a single-cell imaging data set to predict whether a cell is infected by a pathogen or not.

We start by activating our environment:

In [1]:
source ~/miniconda3/bin/activate pybda

(pybda) 

: 1

To fit the two models, we can make use of a file we already provided in `data`. This should do the trick:

In [2]:
cd data

(pybda) 

: 1

In [3]:
cat pybda-usecase-regression.config

spark: spark-submit
infile: single_cell_imaging_data.tsv
predict: single_cell_imaging_data.tsv
outfolder: results
meta: meta_columns.tsv
features: feature_columns.tsv
regression: forest, gbm
family: binomial
response: is_infected
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda) 

: 1

The config file above we will do the following:

* fit a random forest and gradient boosting models,
* regress the `response` column on the features in `feature_columns.tsv`,
* use a `binomial` family variable,
* predict the response using the fitted models using the data set in `predict`,
* give the Spark driver 1G of memory and the executor 1G of memory,
* write the results to `results`,
* print debug information.

So, a brief file like this is enough!

We then call PyBDA like this:

In [4]:
pybda regression pybda-usecase-regression.config local

Checking command line arguments for method: regression
[1;33m Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
	 -> regression (single_cell_imaging_data.tsv, results/2019_08_09/gbm_from_single_cell_imaging_data.tsv)
	 -> regression (single_cell_imaging_data.tsv, results/2019_08_09/forest_from_single_cell_imaging_data.tsv)
[0m
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /bin/bash[0m
[33mProvided cores: 1[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob counts:
	count	jobs
	1	forest
	1	gbm
	2[0m
	count	jobs
	1	forest
	1	gbm
	2
[32m[0m
[2019-08-09 00:15:53,221 - INFO - snakemake.logging]: 
[32m[Fri Aug  9 00:15:53 2019][0m
[2019-08-09 00:15:53,222 - INFO - snakemake.logging]: [Fri Aug  9 00:15:53 2019]
[32mrule gbm:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_09/gbm_from_single_cell_imaging_data-statistics.tsv
    jobid: 0[0m
[2019-08-09 00:15:53,222 - INFO - snakemake.logging]: rule gbm:
    input: single_cell_imag

: 1

That's it! The call automatically executes the jobs defined in the config. After both ran, we should check the plots and statistics. Let's see what we got:

In [5]:
cd results
ls -lgG *

(pybda) total 13832
-rw-rw-r-- 1    2909 Aug  9 00:17 forest_from_single_cell_imaging_data.log
-rw-r--r-- 1 5320871 Aug  9 00:17 forest_from_single_cell_imaging_data-predicted.tsv
-rw-rw-r-- 1  406579 Aug  9 00:17 forest_from_single_cell_imaging_data-spark.log
-rw-rw-r-- 1     118 Aug  9 00:17 forest_from_single_cell_imaging_data-statistics.tsv
-rw-rw-r-- 1    2903 Aug  9 00:17 gbm_from_single_cell_imaging_data.log
-rw-r--r-- 1 5323224 Aug  9 00:17 gbm_from_single_cell_imaging_data-predicted.tsv
-rw-rw-r-- 1 3084636 Aug  9 00:17 gbm_from_single_cell_imaging_data-spark.log
-rw-rw-r-- 1     130 Aug  9 00:17 gbm_from_single_cell_imaging_data-statistics.tsv
(pybda) 

: 1

Let's check how good the two methods compare:

In [6]:
cat */gbm_from_single_cell_imaging_data-statistics.tsv

family	response	accuracy	f1	precision	recall
binomial	is_infected	0.9349	0.9348907798833392	0.9351464843746091	0.9349000000000001
(pybda) 

: 1

In [7]:
cat */forest_from_single_cell_imaging_data-statistics.tsv

family	response	accuracy	f1	precision	recall
binomial	is_infected	0.8236	0.8231143143597965	0.8271935801788475	0.8236
(pybda) 

: 1

The GBM performed way better than the random forest. That is hardly surprising, because the data set is very noisy, thus recursively training on the errors of a learner should be advantageous.

PyBDA creates plenty of other files to check out! For instance, we should always look at the log files we created:

In [8]:
cat */gbm_from_single_cell_imaging_data.log

[2019-08-09 00:15:55,705 - INFO - pybda.spark_session]: Initializing pyspark session
[2019-08-09 00:15:57,046 - INFO - pybda.spark_session]: Config: spark.master, value: local
[2019-08-09 00:15:57,046 - INFO - pybda.spark_session]: Config: spark.driver.memory, value: 1G
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.app.name, value: gbm.py
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.driver.port, value: 39021
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.rdd.compress, value: True
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.app.id, value: local-1565302556519
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.serializer.objectStreamReset, value: 100
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.driver.host, value: 192.168.1.33
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.executor.id, value: driver
[2019-08-09 00:15:57,047 - 

: 1

Furthermore, the Spark `log` file is sometimes important to look at when the methods failed:

In [9]:
head */gbm_from_single_cell_imaging_data-spark.log

2019-08-09 00:15:54 WARN  Utils:66 - Your hostname, hoto resolves to a loopback address: 127.0.1.1; using 192.168.1.33 instead (on interface wlp2s0)
2019-08-09 00:15:54 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-08-09 00:15:54 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-09 00:15:55 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-08-09 00:15:55 INFO  SparkContext:54 - Submitted application: gbm.py
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing view acls to: simon
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing modify acls to: simon
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-08-09 00:15:55 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(simon);

: 1