Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "nonphysical" keyword to Rearrangement and Cell #769

Open
scharch opened this issue Feb 29, 2024 · 7 comments
Open

Add a "nonphysical" keyword to Rearrangement and Cell #769

scharch opened this issue Feb 29, 2024 · 7 comments

Comments

@scharch
Copy link
Contributor

scharch commented Feb 29, 2024

Closely related to #201, obviously, but I'm actually more thinking about #317 and efforts to simplify the Clone schema. For #201, all Rearrangements/Cells in a Repertoire would be nonphysical, which is why I suggested a Repertoire-level is_simulated keyword.

However, in the Clone space we have inferred intermediates/ancestors, which I guess would either be part of the same Repertoire as the observed Rearrangements/Cells they are based on, or maybe not part of a Repertoire at all.

Currently, we handle this by siloing them into the Clone schema, either directly in Clone (using fields like v_call, germline_alignment, etc) or by converting them into Node objects (which in turn requires Tree to be an object instead of just a field). That's what's making #317 hard, because we've set up Clone and Node to mimic Rearrangements and now we also want them to be able to mimic Cells.

If we instead store the inferred intermediates/ancestors as bona fide but nonphysical Rearrangements/Cells, then Clone can just have a generic array of members and the problem goes away. So crazy it just might work?

@schristley
Copy link
Member

Ah, okay, I understand better what you are saying now. Creating inferred Rearrangments/Cells would be a nice way to re-use the schema, yes so crazy it might work! However, it creates the situation that a fake Repertoire needs to be created to hold it all together. But even that is not right, presumably you do have a real Repertoire for the data, but while doing DataProcessing, you are creating inferred Rearrangements and you don't want them to be accidentally included in other computation on the "real" rearrangements. Assigning those inferred Rearrangements to another Repertoire would tend to break the whole chain of processing.

Yes, this is particularly tricky and goes beyond just the idea of supporting "simulated" data sets. I'll ponder on this awhile, but my initial thought is that these inferred things need to be in their own "collections" separate from the other data, yet tied to it using an independent identifier.

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.

@schristley
Copy link
Member

My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.

But you still want it to be connected to a real repertoire with the experimental protocol, right? Because if I'm understanding properly, you are still doing a (say) single-cell experiment, which is described in a Repertoire, and that you process into rearrangements/cells, but when you start investigating clones and lineage, you are inferring new sequences?

That's slightly different from a simulated dataset where essentially everything is "fake"

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

Yes and yes. So unlike a simulated repertoire, those fields wouldn't be nulled.

@schristley
Copy link
Member

There is an "easy" solution but it unfortunately creates significant churn. That is, add an identifier. Just like repertoire_id partitions rearrangements between repertoires, and then data_processing_id at the next level to partition rearrangements within the same repertoire for different data processings, you could add an identifier at a third level that further partitions between real vs inferred. The problem is that implies significant change across the whole tool chain, i.e. the ADC API and analysis tools, which have repertoire_id and data_processing_id baked into their code. All that would need to be rewritten to support a third identifier. So scratch that off the list.

We want to avoid breaking the existing tool chain, so that implies that the inferred rearrangements/cells need to have a different repertoire_id.

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

you could add an identifier at a third level that further partitions between real vs inferred

I don't think that would work, anyway: Clones will frequently be calculated on RepertoireGroups, so it wouldn't be obvious which Repertoire to put an inferred sequence in even if you could distinguish it by an _id.

@schristley
Copy link
Member

you could add an identifier at a third level that further partitions between real vs inferred

I don't think that would work, anyway: Clones will frequently be calculated on RepertoireGroups, so it wouldn't be obvious which Repertoire to put an inferred sequence in even if you could distinguish it by an _id.

Ok, I missed that. So I guess when you say RepertoireGroup, you mean that you have multiple repertoires for a subject, e.g. time course or different tissues or such, and you want to combine them together when doing the clonal inference? Makes sense to me.

@scharch scharch mentioned this issue Mar 21, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants