Add a "nonphysical" keyword to Rearrangement and Cell #769

scharch · 2024-02-29T17:24:49Z

Closely related to #201, obviously, but I'm actually more thinking about #317 and efforts to simplify the Clone schema. For #201, all Rearrangements/Cells in a Repertoire would be nonphysical, which is why I suggested a Repertoire-level is_simulated keyword.

However, in the Clone space we have inferred intermediates/ancestors, which I guess would either be part of the same Repertoire as the observed Rearrangements/Cells they are based on, or maybe not part of a Repertoire at all.

Currently, we handle this by siloing them into the Clone schema, either directly in Clone (using fields like v_call, germline_alignment, etc) or by converting them into Node objects (which in turn requires Tree to be an object instead of just a field). That's what's making #317 hard, because we've set up Clone and Node to mimic Rearrangements and now we also want them to be able to mimic Cells.

If we instead store the inferred intermediates/ancestors as bona fide but nonphysical Rearrangements/Cells, then Clone can just have a generic array of members and the problem goes away. So crazy it just might work?

The text was updated successfully, but these errors were encountered:

schristley · 2024-03-04T03:30:33Z

Ah, okay, I understand better what you are saying now. Creating inferred Rearrangments/Cells would be a nice way to re-use the schema, yes so crazy it might work! However, it creates the situation that a fake Repertoire needs to be created to hold it all together. But even that is not right, presumably you do have a real Repertoire for the data, but while doing DataProcessing, you are creating inferred Rearrangements and you don't want them to be accidentally included in other computation on the "real" rearrangements. Assigning those inferred Rearrangements to another Repertoire would tend to break the whole chain of processing.

Yes, this is particularly tricky and goes beyond just the idea of supporting "simulated" data sets. I'll ponder on this awhile, but my initial thought is that these inferred things need to be in their own "collections" separate from the other data, yet tied to it using an independent identifier.

scharch · 2024-03-04T03:35:25Z

My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.

schristley · 2024-03-04T03:47:11Z

My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.

But you still want it to be connected to a real repertoire with the experimental protocol, right? Because if I'm understanding properly, you are still doing a (say) single-cell experiment, which is described in a Repertoire, and that you process into rearrangements/cells, but when you start investigating clones and lineage, you are inferring new sequences?

That's slightly different from a simulated dataset where essentially everything is "fake"

scharch · 2024-03-04T03:52:19Z

Yes and yes. So unlike a simulated repertoire, those fields wouldn't be nulled.

schristley · 2024-03-04T14:14:37Z

There is an "easy" solution but it unfortunately creates significant churn. That is, add an identifier. Just like repertoire_id partitions rearrangements between repertoires, and then data_processing_id at the next level to partition rearrangements within the same repertoire for different data processings, you could add an identifier at a third level that further partitions between real vs inferred. The problem is that implies significant change across the whole tool chain, i.e. the ADC API and analysis tools, which have repertoire_id and data_processing_id baked into their code. All that would need to be rewritten to support a third identifier. So scratch that off the list.

We want to avoid breaking the existing tool chain, so that implies that the inferred rearrangements/cells need to have a different repertoire_id.

scharch · 2024-03-04T14:25:57Z

you could add an identifier at a third level that further partitions between real vs inferred

I don't think that would work, anyway: Clones will frequently be calculated on RepertoireGroups, so it wouldn't be obvious which Repertoire to put an inferred sequence in even if you could distinguish it by an _id.

schristley · 2024-03-04T15:49:21Z

you could add an identifier at a third level that further partitions between real vs inferred

I don't think that would work, anyway: Clones will frequently be calculated on RepertoireGroups, so it wouldn't be obvious which Repertoire to put an inferred sequence in even if you could distinguish it by an _id.

Ok, I missed that. So I guess when you say RepertoireGroup, you mean that you have multiple repertoires for a subject, e.g. time course or different tissues or such, and you want to combine them together when doing the clonal inference? Makes sense to me.

This was referenced Feb 29, 2024

Re-visit implementation decisions around Cell object #768

Closed

Extend Clone to single-cell context #317

Open

scharch mentioned this issue Mar 21, 2024

Clone-schema-updates #778

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "nonphysical" keyword to Rearrangement and Cell #769

Add a "nonphysical" keyword to Rearrangement and Cell #769

scharch commented Feb 29, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024 •

edited

Loading

schristley commented Mar 4, 2024

Add a "nonphysical" keyword to Rearrangement and Cell #769

Add a "nonphysical" keyword to Rearrangement and Cell #769

Comments

scharch commented Feb 29, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024 • edited Loading

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024 •

edited

Loading