# Train a Low Rank Matrix Factorization Model
author: Andrew E. Davidson, aedavids@ucsc.edu

Use this notebook to train models using different hyper parameters. 

Hyper Parameter options:
- numFeatures: the number of features to learn
- filterPercent: removes genes that are missing more than filterPercent of values
    + low rank matrix factorization works well with sparse data. It only trains on observed values. How ever it is a good idea to remove genes that are missing a high percentage of values if for no other reason than the model will train much faster
- holdOutPercent
    + enables you to control how much data you wish to train on 
    + split() is called  a second time to break the holdout set into a validation and test set of equal size
    
## output:
Trained results, along with clean versions of our gene essentiality data are written to the same directory as the raw data file

- *X*.csv: our low dimension matrix with shape numGenes x numLearned Features
- *Theta*.csv: our low dimension matrix with shape cellLines x numLearned Features
- *Y*.csv: a clean, filtered, tidy version of the gene essentiality data
- *R*.csv: a knockout matrix with values = 1 for observed values else 0
- *RTRain*, *Rvalidation*, *RTest*,knockout matrix you can use to select data sets from Y

sample files
```
$ ls -1 data/n_19_geneFilterPercent_0.25_holdOutPercent_0.4/
D2_Achilles_gene_dep_scores_RTEST_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_RTRAIN_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_RTRUTH_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_RVALIDATE_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_Theta_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_X_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_Y_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_cellLineNames_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
D2_Achilles_gene_dep_scores_geneNames_n_19_geneFilterPercent_0.25_holdOutPercent_0.4.csv
```

This notebook does not evaluate model performace.

In [1]:
from datetime import datetime
import logging
from   setupLogging import setupLogging
configFilePath = setupLogging( default_path='src/test/logging.test.ini.json')
logger = logging.getLogger("notebook")
logger.info("using logging configuration file:{}".format(configFilePath))

import numpy as np
from DEMETER2.lowRankMatrixFactorizationEasyOfUse \
    import LowRankMatrixFactorizationEasyOfUse as LrmfEoU

[INFO <ipython-input-1-dae33e64cf83>:6 - <module>()] using logging configuration file:src/test/logging.test.ini.json


## filterPercent Hyper Parameter Tunning. 
Based on explore.ipynb setting filtePercent = 0.01 will remove 5604 genes. This suggests model only works or able to impute missing values in dense matrix. Setting filterPercentTunning = 0.25 removes 5562 genes and suggest that model works with sparce matrix. Becareful about the "optics" of setting the hyper parmeter. In reality I do not think it makes a difference

In [2]:
np.random.seed(42)

In [3]:
dataDir = "data/"
dataFileName = "D2_Achilles_gene_dep_scores.tsv"
numFeatures = 19
geneFilterPercent = 0.25 
holdOutPercent = 0.40 

#randomize=True
#easyOfUse = LrmfEoU(dataDir, dataFileName, numFeatures, geneFilterPercent, holdOutPercent, randomize, tag="_randomized")
easyOfUse = LrmfEoU(dataDir, dataFileName, numFeatures, geneFilterPercent, holdOutPercent)

In [4]:
start = datetime.now()
dateTimeFmt = "%H:%M:%S"
startTime = start.strftime(dateTimeFmt)

%time resultsDict = easyOfUse.runTrainingPipeLine()

end = datetime.now()
endTime = end.strftime(dateTimeFmt)

duration = end - start
durationTime = str(duration)

fmt = "data is up: numFeatures:{} holdOut:{} run time: {}"
msg = fmt.format(numFeatures, holdOutPercent, durationTime)
cmdFmt = 'tell application \\"Messages\\" to send \\"{}\\" to buddy \\"Andy Davidson\\"'
cmd = cmdFmt.format(msg)
print(cmd)

# apple iMessage
#! osascript -e 'tell application "Messages" to send "data is up " to buddy "Andy Davidson"'
! osascript -e "$cmd"

[INFO lowRankMatrixFactorizationEasyOfUse.py:96 - runTrainingPipeLine()] begin load and clean
[INFO lowRankMatrixFactorizationEasyOfUse.py:98 - runTrainingPipeLine()] load and clean completed
[INFO lowRankMatrixFactorizationEasyOfUse.py:103 - runTrainingPipeLine()] begin data set split
[INFO lowRankMatrixFactorizationEasyOfUse.py:107 - runTrainingPipeLine()] end data set split
[INFO lowRankMatrixFactorizationEasyOfUse.py:109 - runTrainingPipeLine()] begin model fit
         Current function value: 25905.072274
         Iterations: 186
         Function evaluations: 399
         Gradient evaluations: 387
[INFO lowRankMatrixFactorizationEasyOfUse.py:111 - runTrainingPipeLine()] end model.fit
CPU times: user 22min 45s, sys: 16.3 s, total: 23min 1s
Wall time: 5min 35s
tell application \"Messages\" to send \"data is up: numFeatures:19 holdOut:0.4 run time: 0:05:35.844662\" to buddy \"Andy Davidson\"


In [5]:
storageLocation = easyOfUse.saveAll(resultsDict)
print("model and tidy data sets have been stored to {}".format(storageLocation))

model and tidy data sets have been stored to data/n_19_geneFilterPercent_0.25_holdOutPercent_0.4


In [6]:
# clean tidy version of demeter data
Y, R, cellLines, geneNames, = resultsDict["DEMETER2"]

# trained model
# scipy.optimize.OptimizeResult
X, Theta, optimizeResult = resultsDict["LowRankMatrixFactorizationModel"]

# knockout logical filters. Use to select Y Train, Validations, and Test values
RTrain, RValidation, RTest = resultsDict["filters"]

In [7]:
easyOfUse.dipslayOptimizedResults(optimizeResult)

success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 25905.072273820377
number of iterations: 186
 Number of evaluations of the objective functions : 399
 Number of evaluations of the objective functions and of its gradient : 387


# results
## numFeatures_100_n_100_geneFilterPercent_0.25_holdOutPercent_0.4
```
start Time:09:35:51
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 16876.330748
         Iterations: 316
         Function evaluations: 675
         Gradient evaluations: 663
CPU times: user 2h 50min 21s, sys: 7min 50s, total: 2h 58min 12s
Wall time: 22min 20s
success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 16876.330748218537
number of iterations: 316
 Number of evaluations of the objective functions : 675
 Number of evaluations of the objective functions and of its gradient : 663
end Time:09:58:11
duration:0:22:20.553166
```

## numFeatures_50_n_50_geneFilterPercent_0.25_holdOutPercent_0.4

```
start Time:07:11:42
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 21294.349868
         Iterations: 202
         Function evaluations: 443
         Gradient evaluations: 432
CPU times: user 56min 3s, sys: 1min 23s, total: 57min 27s
Wall time: 7min 11s
success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 21294.34986791021
number of iterations: 202
 Number of evaluations of the objective functions : 443
 Number of evaluations of the objective functions and of its gradient : 432
end Time:07:18:53
duration:0:07:11.213200
```

## numFeatures_25_n_25_geneFilterPercent_0.25_holdOutPercent_0.4
```
start Time:08:38:39
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 24845.809920
         Iterations: 164
         Function evaluations: 424
         Gradient evaluations: 414
CPU times: user 22min 33s, sys: 11 s, total: 22min 44s
Wall time: 4min 8s
success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 24845.809919987216
number of iterations: 164
 Number of evaluations of the objective functions : 424
 Number of evaluations of the objective functions and of its gradient : 414
end Time:08:42:48
duration:0:04:08.631190
```

## numFeatures_19_n_19_geneFilterPercent_0.25_holdOutPercent_0.4
```
start Time:11:26:43
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 25905.072274
         Iterations: 186
         Function evaluations: 399
         Gradient evaluations: 387
CPU times: user 21min 30s, sys: 21.7 s, total: 21min 52s
Wall time: 3min 54s
success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 25905.072273820377
number of iterations: 186
 Number of evaluations of the objective functions : 399
 Number of evaluations of the objective functions and of its gradient : 387
end Time:11:30:37
duration:0:03:54.615135
```

## n_19_geneFilterPercent_0.25_holdOutPercent_0.4_randomized
```
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 31563.366168
         Iterations: 143
         Function evaluations: 343
         Gradient evaluations: 332
[INFO lowRankMatrixFactorizationEasyOfUse.py:108 - runTrainingPipeLine()] end model.fit
CPU times: user 19min 58s, sys: 18.1 s, total: 20min 16s
Wall time: 4min 58s
```

## numFeatures_14_n_14_geneFilterPercent_0.25_holdOutPercent_0.4
```
start Time:11:16:54
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 27115.341436
         Iterations: 150
         Function evaluations: 342
         Gradient evaluations: 330
CPU times: user 16min 30s, sys: 8.69 s, total: 16min 39s
Wall time: 3min
success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 27115.34143592632
number of iterations: 150
 Number of evaluations of the objective functions : 342
 Number of evaluations of the objective functions and of its gradient : 330
end Time:11:19:55
duration:0:03:00.456812
```

## numFeatures = 3 holdOutPrecent = 0.1 filterPercent = 0.25
```
start Time:19:35:15
Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 31109.828038
         Iterations: 127
         Function evaluations: 295
         Gradient evaluations: 283
CPU times: user 13min 38s, sys: 5.05 s, total: 13min 43s
Wall time: 2min 29s
success:False
status:2 message:Desired error not necessarily achieved due to precision loss.
final cost: 31109.82803776644
number of iterations: 127
 Number of evaluations of the objective functions : 295
 Number of evaluations of the objective functions and of its gradient : 283
end Time:19:37:45
duration:0:02:29.423916
```