Make the covariate calculation part of RF #193

BauerLab · 2021-02-22T00:07:49Z

Implement covariates in vs hail interface function varspark.hail.random_forest_model with the interface analogous to hail's logistic_regression_rows() methos (see: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows)

Currently random_forest_model takes the covariates argument but it is ignored.

The argument should be a list of expression of type float64 (same as logistic_regression_rows) and these expressions should be evaluated and included as continuous variables for random forest model building along the ordered factor genotypic variables.

Because we want to keep the variant importance output table as is (that is indexed by locus and list of alleles) we cannot include the importances of covariates in it. Instead we should a method covariates_importance() on RandomForestModel that returns a table with the following schema:

'covariate': str - the name of the covariate (as in the input expression)
'importance':float64 - the importance
'splitCount':int64 - the split count

indexed with the 'covariate'.

So for example we can have a phenotype file in csv format like this (this is the modified hipster_lableles.txt):

samples,score,label, age, PC0,PC1, PC2
HG00096,6,0,64,0.332,0.444,0.45
HG00097,15.5,1,32,0.444,0.334,0.128 
HG00099,8,0,17,0.983,0.344,0.4343
...

Then the code with the use of covariates age, PC0 and PC1 should look like this:

    data = hl.import_vcf(os.path.join(PROJECT_DIR, 'data/hipsterIndex/hipster.vcf.bgz'))
    labels = hl.import_table(os.path.join(PROJECT_DIR, 'data/hipsterIndex/hipster_labels.txt'),
                             delimiter=',',
                             types=dict(label='float64', score='float64', age='float64', PC0='float64',
                                 PC1='float64',, PC2='float64'
                             )).key_by('samples')

    mt = data.annotate_cols(hipster=labels[data.s])
    print(mt.count())

    with vshl.random_forest_model(y=mt.hipster.label,
                                  x=mt.GT.n_alt_alleles(), 
                                  covariates = [mt.hipster.age, mt.hipster.PC0, mt.hipster.PC1],
                                  seed=13, mtry_fraction=0.05,
                                  min_node_size=5, max_depth=10) as rf_model:
        rf_model.fit_trees(100, 50)

        ...
        rf_model.covriate_importance().show() # display the covariate importance table

Output:

covariate importance splitCount
age         10.0          200
PC0         1.0             4
...

Things to consider:

in the first instance all covariates can be treated as continuous variables but in the future we could try to use more fine-grained mapping of the hail schema type to VariantSpark feature types which would for example allow to include some covariates as OrderedFactor or Factor(not implemented yet) variables.

Implementation notes

The examples in 'python` directory and possibly a notebook example should be created to demonstrate this new functionality.

The python example can be based on the examples/local_run-importance-ch22_with_pheno.sh. The transposed version of the data/chr22_1000_pheno-wide.csv may need to be created to support this (that should also include the classification response variable).

For the notebook datasets a more biologically relevant dataset should be used that possibly includes principal component analysis factors. Maybe hipster index dataset can be adapted for this (maybe with PC factors or some other random covariates that do not need to associated with the response)

The text was updated successfully, but these errors were encountered:

piotrszul · 2022-02-22T01:15:42Z

Hi @lm-fsng what do you think about the spec above. In particular how would you see getting the importances of the covariates back (and is this even necessary)

piotrszul · 2022-02-23T01:19:42Z

Hi @lm-fsng I have included the changes we discussed. Please confirm that you are happy with the current spec.
And also please follow up with Rob on whether the covariates need to be included in p-value caluclation.

lm-fsng · 2022-02-23T03:45:09Z

Hi @piotrszul and @rocreguant , just spoke to Rob about the covariates and the p-value calculation. He said that the continuous covariates would be systematically biased to be more important, which would 'push' the genotypes down on the variable importance scale. This would result in less significant SNPs if we include the covariates in the p-value calculation. The workaround that is to ignore the covariates and just use the variable importance of the SNPs in the p-value method.

BauerLab added the enhancement label Feb 22, 2021

piotrszul added this to To do in Release 0.4 via automation Mar 1, 2021

piotrszul removed this from To do in Release 0.4 Feb 10, 2022

piotrszul added this to To do in Release 0.5 Feb 10, 2022

piotrszul assigned rocreguant Feb 23, 2022

rocreguant moved this from To do to In progress in Release 0.5 Feb 23, 2022

rocreguant mentioned this issue Mar 7, 2022

Covariates implementation and interface #209

Merged

piotrszul closed this as completed in #209 May 11, 2022

Release 0.5 automation moved this from In progress to Done May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the covariate calculation part of RF #193

Make the covariate calculation part of RF #193

BauerLab commented Feb 22, 2021 •

edited by piotrszul

piotrszul commented Feb 22, 2022

piotrszul commented Feb 23, 2022

lm-fsng commented Feb 23, 2022

Make the covariate calculation part of RF #193

Make the covariate calculation part of RF #193

Comments

BauerLab commented Feb 22, 2021 • edited by piotrszul

piotrszul commented Feb 22, 2022

piotrszul commented Feb 23, 2022

lm-fsng commented Feb 23, 2022

BauerLab commented Feb 22, 2021 •

edited by piotrszul