Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructuring of Cell #776

Merged
merged 8 commits into from Apr 8, 2024
Merged

Restructuring of Cell #776

merged 8 commits into from Apr 8, 2024

Conversation

javh
Copy link
Contributor

@javh javh commented Mar 6, 2024

Initial changes to Cell, CellExpression and CellReactivity per March 4, 2024 call:

  • Remove Rearrangement and expression links in Cell.
  • Renamed CellExpression to Expression.
  • Renamed CellReactivity to Reactivity.

Also:

  • Added cell_subset, cell_phenotype, and cell_label for CL term, marker definition, and free text annotation, respectively.
  • Not sure I like these names, but I stuck with what was already in the CellProcessing object.
  • Not sure we need cell_phenotype, but I thought I'd include it for discussion purposes.

I only modified the main v2 spec. I will propagate once we agree on the changes.

Closes #768
Closes #477

@javh javh added this to the AIRR 2.0 milestone Mar 6, 2024
@javh javh requested review from bcorrie, scharch and bussec March 6, 2024 01:08
@javh javh added this to To do in Single Cell via automation Mar 6, 2024
@scharch
Copy link
Contributor

scharch commented Mar 6, 2024

Looks fine. Will need docs updates, at minimum for the name of Expression and Reactivity. But I think we should also add explicit guidance that says either "put RNAseq data in Expression" or explain how external data (eg GEO) can be linked at the Sample level.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 6, 2024

Yes, looks good to me as well... Will hold off on review check until final is ready, but I think safe to sync the specs.

Single Cell automation moved this from To do to In progress Mar 25, 2024
@bussec
Copy link
Member

bussec commented Mar 25, 2024

Looks good! Wondering whether we should also introduce a cell_label property into the CellProcessing object...

@javh
Copy link
Contributor Author

javh commented Mar 25, 2024

From the call:

  • Add cell_label to CellProcessing.
  • Set Cell.cell_subset to miairr: important. Verify this is clear in the documentation.
  • Sync specs.
  • Merge.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 29, 2024

Hate to say it, but I am trying to use the spec as it is now, and I see an issue...

Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.

The AIRR Spec has the V Domain for both chains as required and non-nullable. Is that really what we want. We would have no way of capturing IEDB:184739 in the AIRR Receptor schema if that was the case???

Some IEDB records do have the full domain (e.g. https://www.iedb.org/receptor/47) but this is not normal. Don't we want the Receptor object to be more flexible than that?

@bcorrie
Copy link
Contributor

bcorrie commented Mar 29, 2024

I am OK to make the above a separate issue so we can merge the current PR.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 29, 2024

Asking the question another way. Lets say I analyze a single cell study and I find that some cells match (at some level of matching) a known Receptor in IEDB. Lets say the CDR3 and VJ genes are the same. So I create a Receptor object and link that Cell to that Receptor by adding the receptor_id to the Cell.receptors array.

But I am only allowed to do this if the IEDB entry is a paired chain with full V Domain for both chains. If you look at IEDB and search for paired chain receptors you are limited to 31,346 out of the 189,435 "Receptor like things" in IEDB. I am not 100% sure "paired chain == full v domain for both chains", so this might be limited even further.

This seems like a significant restriction. Should we be relaxing the constraints on the Receptor object?

@scharch
Copy link
Contributor

scharch commented Mar 29, 2024

Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.

It's a TCR, so the full V domain is implied. I think it's ok.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 29, 2024

Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.

It's a TCR, so the full V domain is implied. I think it's ok.

Problem is that the receptor_variable_domain_1_aa field is required (as is the domain 2 field), non-nullable, and says:

        receptor_variable_domain_1_aa:
            type: string
            description: >
                Complete amino acid sequence of the mature variable domain of the Ig heavy, TCR beta or TCR delta chain.
                The mature variable domain is defined as encompassing all AA from and including first AA after the the
                signal peptide to and including the last AA that is completely encoded by the J gene.
            example: >
              QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLEWIGRIDPNSGGTKYNEKFKSKATLTVDKPSSTAYMQLSSLTSEDSAVYYCARYDYYGSSYFDYWGQGTTLTVSS
            x-airr:
                nullable: false
                adc-query-support: true

So because we don't have the AA variable domains, you can't create an AIRR standards compliant Receptor object for https://www.iedb.org/receptor/184739

It seems to me that our requirements are to stringent.

@bussec
Copy link
Member

bussec commented Mar 29, 2024

The stringency in the schema is deliberate, as the purpose of Receptor was always to be complementary to the potentially incomplete receptor information in IEDB. People are free to download the IEDB database and all Receptor objects from the ADC and use whatever algorithm and arbitrary threshold they want to make guesses whether they are sufficiently similar. But I don't think that we need to degrade our schema for that.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 31, 2024

I understand that use case, but if that is indeed the case, then we do not have a mechanism to link an AIRR entity to a known entity in another repository (e.g. IEDB) unless it meets those strict criteria.

There are 150,000+ such entities in IEDB that contain Receptor -> antigen/epitope reactivity information that we are unable to link to given the above constraints. My understanding of the AIRR Receptor object was that it would help us make such a link. You aren't telling me I have been wrong all this time are you 8-)

@bcorrie
Copy link
Contributor

bcorrie commented Mar 31, 2024

People are free to download the IEDB database and all Receptor objects from the ADC and use whatever algorithm and arbitrary threshold they want to make guesses whether they are sufficiently similar. But I don't think that we need to degrade our schema for that.

But shouldn't our schema have a mechanism to capture when such an algorithm is used to annotate an AIRR entity (Cell or Rearrangement) with such specificity/reactivity information.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 31, 2024

Suggest we move this discussion to #781 and merge this PR.

@javh
Copy link
Contributor Author

javh commented Apr 8, 2024

There's currently no Cell documentation to verify w.r.t. CL importance, so we'll come back to it.

@javh javh merged commit 50c3ba2 into master Apr 8, 2024
5 checks passed
Single Cell automation moved this from In progress to Done Apr 8, 2024
@javh javh deleted the issue-768 branch April 8, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Single Cell
  
Done
4 participants