Skip to content

CoreGx Design Documentation

Christopher Eeles edited this page Jun 3, 2022 · 17 revisions

TreatmentResponseExperiment

Note: Since a TreatmentResponseExperiment is currently just a wrapper around the LongTable class, all of this documentation applies to both objects!

Object Design

TODO::

Object Dimensions

A TreatmentResponseExperiment (TRE) has list-like and table-like behaviors. For table-like behaviors, rows are defined by one or more key columns which uniquely identify each row of the data.table in the rowData slot. These columns are referred to as the rowIDs and are concatenated together with the ':' character to make pseudo-rownames. The same is true for the colData table, with associated colIDs and pseudo-colnames.

Use of such pseudo-dimnames allows a TRE to be subset analogously to a base data.frame by specifying the dimension names of the "rows" or "columns" of the object. As a result the [ method exploits the table-like behaviours of the object. In addition to data.frame like subsets, two additional mechanism for sub-setting have been implemented. Firstly, pseudo-dimnames can be specified using glob or regex patterns, which are matched against the pseudo-dimnames before returning the subset. Secondly, the [ method allows use of data.table style subsets using expressions, with the caveat that any expression subset query needs to be wrapped in the .() function to protect calls from early evaluation during S4-method dispatch. These protect expressions are then passed through to the i argument of the rowData or colData data.tables.

The assays slot of a TRE contains the measurements of interest in the object and possesses list-like behaviors. You can access and assign an assay via the $ and [[ methods. However, table-like subsets on the object via [ or subset do the necessary internal work to subset each item in the assays list as well.

Assay Index

The assay index table was introduced to allow aggregation operations over rowKey and colKey values to be stored inside a TreatmentResponseExperiment. Previously assays were keyed directly by the values of rowKey and colKey and thus no assay could store a summary over the rowID or colID columns. This effectively made it impossible to store interesting aggregations, for example summaries over dose or replicates, inside a TreatmentResponseExperiment object.

To resolve this issue, two additional pieces of structural metadata have been added to the .intern slot. The assayIndex is a table which maps from rowKey and colKey combinations to an integer key for each assay table. The assayKeys are a list of rowIDs and colIDs which are required to uniquely identify a measurement in an assay. The assayKeys are used to define an integer assay key column in each assay data.table. This prevents unnecessary repetition of character metadata columns inside the assays of a TRE and acts as a form of compression vs storing the data in a single, long-format data.table. Initial tests indicate about a 50% reduction in object size vs the long-format data.table, which will increase with the number of rowData and colData columns, but decrease slightly with the number of assays in a TRE.

Summaries inside of a specific assay can be stored by repeating the value of the associated assayKey in the corresponding column of the assayIndex. This ensures that the data which has been aggregated over can still be retrieved while also allowing storage of summaries over some subset of rowKey and colKey values. For now, the assayIndex will contain a column for each assay in the TRE, even if the assays is "parallel" to other assays (i.e., keyed by the same columns). While this does slightly increase the size of the object due to storing repeated information, it greatly simplifies the logic required for subsets, as well as for assigning new assays or computing summaries over an existing assay. The cost of this is on the order of 3.3 MB per million assay rows per assay, which we currently feel is justified via to simply the TRE logic.

Maintaining Order

To prevent the assayIndex from becoming convoluted, we have implemented the reindex method. This method takes in a TRE object and updates the rowKey, colKey and assayKeys such that they are the smallest possible set of consecutive integers. To maintain referential integrity, these keys need to be updated both in the assayIndex as well as in each slot of the object. To make comparison of objects after reindexing simple, a default ordering needs to be implemented for each of the internal slots such that reordering the assayIndex data.table will not result in different results from the TRE accessor methods.

The default ordering in various conditions is outlined below:

Slot Condition Order
rowData internal rowKey
colData internal colKey
assays internal assayKey
rowData accessed; withDimnames=TRUE rowIDs
rowData accessed; withDimnames=FALSE rowKey
colData accessed colIDs
assays accessed; withDimnames=TRUE rowIDs, colIDs
assays accessed; withDimnames=FALSE rowKey, colKey
assays accessed; withDimnames=FALSE & key=FALSE assayKey
assayIndex always assayKeys

Note: work is currently underway to make sure that the keys are always ordered, preventing the need for sorting. Returns from TRE accessor methods will not be keyed by default to ensure working with a TRE is as fast as possible. This should resolve the issues outlined below. Question: if they tables are always ordered, setting keys should have negligible cost?

Potential issues: sorting is a (relatively) expensive operation. While data.table uses radix sort and is very efficient it has the potential to slow down accessors. To avoid these sorts, you can use the secret argument raw=TRUE, which can be passed to the rowData, colData, assay and assays accessor methods to short circuit and return the result of @<slotName>. If you do this, make sure you honour that your method applies the appropriate sort before the final return statement. If your function may be used inside of other CoreGx accessors, also make sure to add code for the raw=TRUE secret argument to ensure wasteful sorting can be avoided.

IMPLEMENTING SECRET [BOOLEAN] ARGUMENTS: (replace angle brackets with actual code!)

if (any(...names() == "<secret_arg>") && isTRUE(...elt(which(...names() == "<secret_arg>"))))
        <do something>
Column orders

To make comparisons between TREs simpler, we will enforce a default column ordering. For each slot in a TRE:

  • key columns will come before identifier columns which will come before metadata columns
  • with each class of columns, row keys/identifiers will come before column keys/identifiers
  • metadata columns will be next, and sorted lexicographically within their respective dimension; this will ensure that the same TRE object is created even if the associated DataMapper has the slot maps specified in different orders (NOT YET IMPLEMENTED!)
  • assay columns will remain in the order they were specified in the assayIDs in the constructor

Re-index Algorithm

  1. Sort rowData and colData by their associated id columns and update their keys in a new column prepended with ".".
  2. Join with index on old rowKey and colKey and update by reference to the new keys, if the keys have changed.
  3. Delete the .rowKey, .colKey from rowData and colData, respectively.
  4. Sort assayIndex by rowKey and colKey (and therefore also by rowIDs and colIDs) and update assayKey into new columns prepended with ".".
  5. Check which assayKeys have changed and join with the assayIndex to update the key in each assay by reference.
  6. Delete the .assayKey columns from assayIndex.

Subset Algorithms

Introduction of the assay index to a TRE has implications for the way subset operations will work. This section will define the requisite operations to subset along different TRE dimensions.

Potential Issues

Due to a fix in #148, the subset method now has the potentially unintuitive behavior of dropping additional rowKey or colKey values which have no observations in any assay. This change was required to remove rowKey and colKey values from the assay index which didn't have at least one observation in an assay, which were breaking assay unit tests.

Questions:

  • Should we throw a warning when additional keys get dropped?
  • Is automatically dropping keys without assay observations desirable? Would it be worth updating the object design to remove this requirement?
rowData/colData

Subsetting by one of the table-like dimensions

assays

Aggregations

A key feature of the TRE object is the ability to apply arbitrary R functions over the object. Since the assays in a TRE are data.tables, we already have a powerful aggregation engine available to us. However, since the data.table internal optimizations only parallelize simple mathematical/statistical functions such as mean, sd, etc. (see data.table::gforce for full list) operations like curve fitting would take an unreasonably long time if implemented with pure data.table. As a result, we have implemented the aggregate2 method, which is equivalent to data.table aggregation with the option to parallelize computations using BiocParallel. This helper is then used inside the aggregate,TreatmentResponseExperiment-method to allow aggregations over the S4-class with the additional specification of an assay to summarize.

An issue with aggregation which, to our knowledge, has not been solved in other S4-classes which implement this feature is invalidation of summarized values after a subset. For example, if we summarize over dose to compute an average dose and viability