Solvers.io project to predict gene expression from motif combinations in promoters
example_data/example_500rows.csv
is an example of the kind of data we can generate.
- There is an initial column,
AGI
, which contains the gene identifier. This is for information only - it can be discarded for the analysis. - A final label column,
Value
, is dummy encoded*: either 1 or -1. 1 means the gene was expressed in a particular cell type, while -1 means it was not expressed. - All the remaining columns are features (transcription factor binding motifs) that exist in the promoter of one or more genes. These are binary: 1 indicates the feature was present, 0 indicates it was absent.
* note: we can also provide scalar values rather than dummy encoding.
## Full datasets
One dataset is provided for now. We can generate many such datasets if needed.
We want to be able to:
- Predict whether a gene will be expressed in a particular condition given its promoter sequence
- Find out exactly which combinations of motifs are important in the predictions
or to rephrase without the biology:
- Predict the
Value
given the feature columns. - Identify which features are important in the predictions.