Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more documentation for duplicate_count #161

Closed
schristley opened this issue Oct 24, 2018 · 36 comments · Fixed by #586
Closed

more documentation for duplicate_count #161

schristley opened this issue Oct 24, 2018 · 36 comments · Fixed by #586

Comments

@schristley
Copy link
Member

schristley commented Oct 24, 2018

From a comment by Mathieu of vidjil group:

In particular, we used the duplicate_count key to describe the number of reads gathered in each clone, as you suggested us in your previous email. To enable a better interoperabilityn maybe this suggestion could be clearly stated in AIRR documentation of this field? For example, taking inspiration on what was already on the sequence_id field, "duplicate_count (...) This may also be a total number of gathered sequences in cases where query sequences have been combined in some fashion by the tool."

@bussec
Copy link
Member

bussec commented Oct 26, 2018

Agreed, we need to describe this in more detail. To make sure we are all on the same page, my interpretation of what the docs should say is:

  • consensus_count indicates the total number of raw reads for a given sequence.
  • duplicate_count indicates the number of UMI-collapsed reads for a given sequence.
  • The counts in consensus_count and duplicate_count should be identical for non-UMI protocols.
  • duplicate_count maps to the MiAIRR field "Read count", which in turn maps to the custom AIRR_READ_COUNT keyword for Genbank annotation.

Is this correct?

@schristley
Copy link
Member Author

Is this correct?

Actually I think it is switched around based upon what Jason has mentioned before.

  • consensus_count is the number of UMI-collapsed reads for a given sequence. That is, UMI is used to build a "consensus" sequence.
  • duplicate_count is the total number of raw reads for a given sequence. Or really in general, any collapsing of reads/sequences that are considered "duplicates".

However, duplicate_count can be a bit fuzzy. For example, some tools that pre-process raw reads, before VDJ assignment is performed, will collapse identical sequence (same length and nucleotides) into a single sequence with a duplicate_count. Other tools don't collapse on sequence but on clones and duplicate_count is the number of sequences/reads in the clone. There is also the scenario that two or more sequences will have different UMI codes, but they are identical, and thus can be collapsed as duplicates.

In the first case of pre-processing, you can imagine that there can be two or more rows of sequences (that differ slightly in length or nucleotides), each with their own duplicate_count value, but all of those rows belong to the same clone. To get a true count for the clone, you would need to sum up the duplicate_count for all of those rows.

Maybe this ambiguity is not good, and we should have a separate clone_count field?

I think the intent of duplicate_count is for collapsing sequences that are considered "identical" (accounting for errors and such) without considering the biological interpretation of the sequence, i.e. "identical" just from an informatic perspective.

@javh
Copy link
Contributor

javh commented Oct 27, 2018

Hrm. I think you two (@schristley and @bussec) might be saying the same thing? Not sure. I agree with the definitions list by @bussec... but I suspect the meaning of "UMI-collapsed reads" is causing confusion.

First, I think we should avoid the word "clone" in this discussion, because its meaning is ambiguous. In a BCR context it usually means a collection of related variants, but in a TCR context it usually means a collection of identical sequences. Same thing - all sequences assumed to represent the same original V(D)J recombination event. But, they are counted in a different way. I think Mathieu means the TCR definition.

I see the fields as:

consensus_count: The "confidence level" of the sequence. The number of reads used to build a consensus sequence. I only use this when the count is purely technical with no real biological meaning. Usually, this would be the number of reads contributing to a UMI consensus sequence. But, it could also be from some other error correction approach, such as clustering sequences inferred to differ only by sequencing error and aggregating them somehow (see IgReC for an example).

duplicate_count: The "copy number" of the sequence. This could be the count of UMIs for each unique sequence (as measured by the number of identical copies) or identical sequence counts for non-UMI protocols. I'm pretty certain the latter is exactly what the Vidjil group wants.

So... I think that's the same thing you're both saying? Example:

After UMI consensus generation, but before duplicate removal:

sequence_id sequence consensus_count
UMI1 AAAAA 4
UMI2 AAAAA 3
UMI3 CCCCC 2

After duplicate removal:

sequence_id sequence consensus_count duplicate_count
UMI1+2 AAAAA 7 2
UMI3 CCCCC 2 1

For non-UMI protocols, I just wouldn't include consensus_count. Though, I think making them identical, as @bussec says, also makes sense. As for AIRR_READ_COUNT, I believe this is suppose to be consensus_count, because it was intended to reflect the confidence level for the submitted sequence.

I think we should wait on defining a clone_count field that would specify the count of clonally related variants. That sounds like part of the lineage schema.

@scharch
Copy link
Contributor

scharch commented Oct 29, 2018

In SONAR, I have implemented duplicate_count as the number of pre-VDJ assignment with exact length and nucleotide sequence matches. Because we don't use UMIs, SONAR does error handling by clustering reads at the expected error threshold, and I implemented a custom cluster_count to hold the number of reads that go into each cluster, sort of the opposite of the way @javh defined consensus_count vs duplicate_count above. I've also put in a clone_count field, though I agree that more properly belongs in a lineage schema.

@bcorrie
Copy link
Contributor

bcorrie commented Dec 13, 2018

FYI, we just ran into this in a data set provided by a collaborator...

MiXCR's clone assignment process is here: https://mixcr.readthedocs.io/en/master/assemble.html

The question is, to which MiAIRR/DataRep field does this get assigned?

@javh
Copy link
Contributor

javh commented Dec 14, 2018

@bcorrie, which part of the MiXCR output are you referring to? Depending upon which value it is, it could be consensus_count, duplicate_count, or something like the clone_count field @scharch added. Looks like they have several aggregating steps (raw reads, identical junction, V/J/C gene).

(Pinging @mikessh and @dbolotin.)

@bcorrie
Copy link
Contributor

bcorrie commented Dec 14, 2018

Sorry, in this case it is the cloneCount field from the MiXCR annotation output. Which you might say obviously should be mapped to clone_count, but given the above discussion I was not clear. And I suspect that given that MiXCR is a commonly used annotation tool, I suspect that we might want to make sure we are clear as to what this mapping should be.

@emilyvcbarr or @nishanth would be able to comment further on which MiXCR processing steps that produce this output file.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 6, 2019

Hello all, just following up on this issue - hoping to maybe close it out... Any input as to the mappings from MiXCR clone count fields to the AIRR fields?

@bcorrie
Copy link
Contributor

bcorrie commented Apr 15, 2019

We discussed duplicate_count and consensus_count a bit more at MinStd this week.

The suggestion was that for each rearrangement record the duplicate_count and consensus_count may or may not exist. The meaning of these fields would depend on the parameters in Section 3 and Section 5 of MiAIRR (PCR prep and Data Processing) and that the actual meaning and interpretation of these counts would likely be ill-defined without considering the higher level metadata.

That is, a researcher would need to know information from Sec 3 and Sec 5 in order to interpret (and more importantly compare/analyze) the consensus_count and duplicate_count provided in two data sets (repertoires) from two different studies.

This means that when providing simple summary statistics (counts of rearrangements) about aggregate data across many repertoires, it is problematic (and possibly misleading) to aggregate the consensus_count and duplicate_count because of the interpretation required.

Does my interpretation make sense given the above discussion.

@javh
Copy link
Contributor

javh commented Apr 15, 2019

Yeah, I think that's accurate. This is further confounded in the case of single-cell data with contig assembly. For example, 10X V(D)J will have reads that went into building the contig and UMI count (expression level). Because I don't have a use for the contig read count, I've been using consensus_count for the UMI count and duplicate_count for duplicate V(D)J sequences (as normal).

But, the semantics are misleading in that case, because in bulk UMI projects, the consensus_count will be the count of assembled reads and the duplicate_count will be the UMI count...

I'm not sure exactly what to do about it, but we should maybe bring this topic up in the DR-WG? I'm starting to think we should maybe have count_* fields instead (count_duplicate, count_read, count_umi, count_clone). Not sure. It's very difficulty, maybe impractical, to support multiple different experimental protocols and analysis protocols with two fields.

@schristley
Copy link
Member Author

Is it possible to separate out the case where a field unambiguously acts as a multiplier for abundance for the rearrangement (versus for clones)? That strikes me as the most important to be unambiguous. If programs need to perform some sort of logic to determine how a field is used, that can be error prone.

@bcorrie
Copy link
Contributor

bcorrie commented Apr 15, 2019

A thought on this as someone who is NOT a domain expert.

It seems to me like we are starting to get into "analysis". Admittedly very simple analysis, but analysis just the same... As soon as you start aggregating things with a non-identity (or not precisely defined) equality test then the test you use for equality is open to interpretation. More importantly, for someone to use the aggregation you have they will need to have a description of how the data was aggregated so they can decide if it makes sense for them to use the aggregation in comparisons.

I suppose I would argue that if we are talking about data sharing and in particular comparing two data sets, having an aggregation that can't be used across data sets isn't particularly useful from a AIRR/MiAIRR data sharing perspective. Certainly important from an analysis perspective, but...

UMI/consensus_count and exact sequence/duplicate_count (equality) seem to me to be well defined. I was hoping that we could define them exactly, but it seems like this is not the case. At the same time, it seems like the confusing thing is not so much that we can't come up with a well defined definition, but it is more that the definition would not be adequate for many different cases (the definition would be too limiting?).

Would it make sense to try to define these precisely (so there are a small number of well defined aggregation metrics), recognizing that if other aggregation metrics are required (other count_* fields) to capture other things there would be other mechanisms to define and record such counts? It seems like if we do define such counts, we also need a mechanism to describe how the aggregation was done...

I would argue that such aggregation metrics (the count_* fields) are definitely starting to go down the analysis path, and it seems to me that we should be separating the storing of the fundamental annotation data from the analysis artifacts that are created for that data.

I suppose what I am wondering is whether analysis results (including aggregations that are loosely defined) should be treated as metadata that is attached to the "basic" or "fundamental" annotation data that we currently have in our existing AIRR DataRep format???

@schristley
Copy link
Member Author

It seems to me like we are starting to get into "analysis".

This isn't problematic to me. In my mind, "pre-processing" of the raw data is "analysis" because you have to make decisions about quality, and what to include/exclude with your filters. Though I do agree that leaving a field open for interpretation is best avoided.

definition would not be adequate for many different cases (the definition would be too limiting?)

Are these different cases listed anywhere? I cannot think of the top of head what they would be. We've done this before so it's reasonable that we don't try to represent everything.

Related to this is the issue that some experimental protocols aren't "quantitative" in the sense that any counts or quantities don't directly relate to the number of receptors/cells in the biology. They might be counting something else like the read coverage, or the number of RNA molecules for example. Comparisons with other data might produce incorrect results.

@bcorrie
Copy link
Contributor

bcorrie commented Apr 18, 2019

It seems to me like we are starting to get into "analysis".

This isn't problematic to me. In my mind, "pre-processing" of the raw data is "analysis" because you have to make decisions about quality, and what to include/exclude with your filters.

I think that is the issue I raise... We are already capturing analysis steps, and I think we are going to hit a brick wall soon because of the explosion of possible different analysis steps.

For the pre-processing you speak of (which we can consider analysis) we have an explicit, well defined process (quality control) and a metadata section that captures the parameters used for that process (e.g. the metadata we capture in SequencingRun and/or DataProcessing) to produce some data. We have been able to capture that because we think it is well defined enough to capture.

This is similar if you consider the processing that is done for annotation in that we have a metadata section that describes how that was done (e.g. in DataProcessing).

In this instance we are talking about collapsing sequences that have been annotated. My argument is that if we can define this precisely, then we can capture it in our current metadata. If duplicate_count is defined to be an exact sequence match, then we make it clear that that is what we mean by duplicate_count and we don't need to do any more and we can use duplicate_count as is.

If we are considering other types of processing/analysis (which we should be) such as collapsing sequence reads using non precise or fuzzy equality metrics or clone counts that produce other analysis artifacts (e.g. @javh count_clone or my new count_duplicate_fuzzy) then we need a metadata section that describes how the process (e.g. clone construction or fuzzy duplicate count) was performed and the parameters used for that process (e.g. number of NT that can differ in my fuzzy duplicate count match).

In the general case, we need a mechanism that links 1) an analysis step (an algorithm) with 2) a set of parameters used in the analysis step with 3) a set of analysis artifacts produced using a specific combination of 1) and 2). This is basically a mechanism to incrementally add analysis layers to a set of data, in principle adding more and more complexity to the analysis.

My understanding is that because the domain starts to get so rich and complex at this point (everyone has their own favorite tool or technique), a general, flexible, and extensible mechanism is required to capture all of the cases. It seems like now might be the time to consider a general solution, as it doesn't seem practical to keep extending well defined analysis parameters - because they aren't well defined any more.

@bcorrie
Copy link
Contributor

bcorrie commented Apr 18, 2019

definition would not be adequate for many different cases (the definition would be too limiting?)

Are these different cases listed anywhere? I cannot think of the top of head what they would be. We've done this before so it's reasonable that we don't try to represent everything.

The cases that I was talking about were what @javh mentioned (below):

I'm not sure exactly what to do about it, but we should maybe bring this topic up in the DR-WG? I'm starting to think we should maybe have count_* fields instead (count_duplicate, count_read, count_umi, count_clone). Not sure. It's very difficulty, maybe impractical, to support multiple different experimental protocols and analysis protocols with two fields.

There are many count_* fields that one could imagine, @javh mentioned four here. I am suggesting that for any count_* field that isn't well defined (is algorithmic) for it to be useful you need to be able to describe how it was calculated so that users of the data know whether things are comparable or not. count_clone is probably the best example, because it probably has a wide variety of viewpoints on what it means.

Related to this is the issue that some experimental protocols aren't "quantitative" in the sense that any counts or quantities don't directly relate to the number of receptors/cells in the biology. They might be counting something else like the read coverage, or the number of RNA molecules for example. Comparisons with other data might produce incorrect results.

I think we are agreeing 8-) I am suggesting that in this more general analysis world we need to have a general mechanism for linking emerging "DataRep fields" (e.g. count_clone) with a general analysis description of how they were calculated (my clones are exact CDR3 matches) so that a researcher can decide whether it makes sense to compare clone counts across two different analyses. I think we want to do that without pre-defining the count_* fields (or all the other analysis fields that will arise). If a researcher can't compare two data sets because the clone counts are not comparable, then they either need access to the underlying data so that they can generate the "desired" count_clone that they can use for comparison or the won't be able to use that data for their comparison.

As you suggest, what we definitely want to avoid is having someone compare count_clone fields between two data sets because they exist when the mechanism for determining count_clone in the two data sets was not comparable...

@scharch
Copy link
Contributor

scharch commented Apr 22, 2019

we need a metadata section that describes how the process (e.g. clone construction or fuzzy duplicate count) was performed and the parameters used for that process

Yes, but isn't this already taken care of by including software versions and parameters/command lines in the metadata? I think it's fine to expect users to lookup how ChangeO defines clones versus SONAR, instead of requiring that to be explicitly repeated into the metadata. In fact, as we get further down the pathway of "analysis," the metadata would explode if the details of every step had to be spelled out.

Certainly there are exceptions to this, like undocumented tools. But in that case, what are the odds that the person using that tools is going to back-fill the metadata, anyway? Maybe there needs to be a field for cases of manual curation, but I would expect those to be fairly rare.

@schristley
Copy link
Member Author

Yes, but isn't this already taken care of by including software versions and parameters/command lines in the metadata?

I agree with this. This can be formalized more but I think there are diminishing returns for the usefulness of that metadata, and IMO this gets beyond the scope of DRWG.

As you suggest, what we definitely want to avoid is having someone compare count_clone fields between two data sets because they exist when the mechanism for determining count_clone in the two data sets was not comparable

Yes true but also no. From a scientific standpoint these comparisons are done all the time as part of data exploration, comparison of methods and so forth. If count_clone has the same definition but different calculation method for those two data sets, then I think it is quite reasonable to compare them. For example, people do this for v_call.

Regardless, I understand your point, if you want to show some sort of graph, which uses abundance numbers, is there any way to insure those numbers are comparable across studies? I think this can be done for quantitative experimental protocols with some well-defined fields. What I worry about are the non-quantitative protocols, which can still be used for comparison, but they might need to be normalized in a certain way to make that comparison valid or useful.

It seems to me that the fields duplicate_count, consensus_count and (maybe) clone_count can be given strict definitions, and I'm fine with renaming them as @javh suggests if that helps.

  • duplicate_count is exact sequence match (equality).

  • consensus_count or count_umi is number of reads (or sequences?) used to build a consensus sequence. While this might be interpreted as a "confidence level", I would suggest that another field might be better for that to support other forms of "confidence", like a probability value.

  • clone_count is the number of ... should it be the informatic "clonotype" or the biological "clone". Some experimental protocols do allow the biological clone number to be estimated. Maybe we should split this into two fields by adding count_clonotype field?

@javh
Copy link
Contributor

javh commented May 17, 2019

I was just chatting with @psathyrella and we might have another problem with the duplicate_count field. The naming is somewhat misleading in that it implies that it's the count of additional duplicate observations (copy number minus 1) rather than the copy number.

copy_count might be better semantics.

@schristley
Copy link
Member Author

Interesting, it never occurred to me to interpret duplicate_count that way. I don't prefer copy_count though because that terminology is frequently used in genomic studies, e.g. the copy count of a gene, and this might lead to confusion down the road. How about some more literal names like number_of_exact_sequences, exact_sequence_count, total_equal_sequences, ... ?

@javh
Copy link
Contributor

javh commented May 20, 2019

That's a good point about copy_count. Honestly, I think our stated intent to prioritize backwards compatibility probably means we shouldn't change the name of duplicate_count in favor of just having better documentation (including not reversing the names to count_*).

@scharch
Copy link
Contributor

scharch commented May 21, 2019

I agree, better not to change the field name at this point.

@psathyrella
Copy link

I wasn't actually confused because of the name -- it's somewhat ambiguous, but I think most choices also would be. I was confused because to me in the description:

Copy number or number of duplicate observations for the query sequence.
                For example, the number of UMIs sharing an identical sequence or the number
                of identical observations of this sequence absent UMIs.

"copy number" sounded like the total number of copies, i.e. including the "main" sequence, while "number of duplicate observations" sounded like it excluded it. I think there's other ways to read this description so it's not contradictory, but either way it seems easy to just make it super unambiguous.

@javh javh added this to In progress in Lineage Feb 19, 2020
@javh javh moved this from In progress to To do in Lineage Feb 19, 2020
@javh javh added this to In progress in Documentation Feb 19, 2020
@schristley
Copy link
Member Author

@javh This has come up again as I've been processing single cell datasets. 10x is putting a read count in duplicate_count, which is confusing compared to other usages where it indicates a count of duplicate sequences. If the duplicate_count for 10x data gets used as an abundance then it gives confusing results.

For example, say I do genomic DNA bulk sequencing, with little to no amplification, the number of sequences is pretty close to the number of genomes. duplicate_count is then a measure of abundance. Am I incorrect in this usage? Is the 10x using the duplicate_count wrong? Are we both using it correctly but the semantics of duplicate_count is overloaded?

@javh
Copy link
Contributor

javh commented Aug 16, 2021

@schristley, I think 10x is using it correctly and the semantics of duplicate_count is overloaded.

They are using duplicate_count for the transcript UMI count and consensus_count for the raw read count for the contig assembly, which are consistent with how they are used for bulk protocols with UMIs. These map to the umis and reads columns in the cellranger contig annotations csv, respectively.

In changeo, we've been using the custom field umi_count for the 10x UMI count to avoid the problem of ambiguous interpretation of duplicate_count. We've left reads as consensus_count.

If we had it to do over again, and maybe we should, something like this seems more compatible with the current menu of protocols:

  • read_count - Raw read count.
  • umi_count - UMI count
  • copy_count - number of duplicate observations.
  • clone_count - number of observations within a given clone_id.

The problem with the above is that fields have overlapping semantics for some protocols. umi_count and copy_count are confusingly similar for 5'RACE bulk sequencing; copy_count and clone_count are confusingly similar for gDNA TCR bulk sequencing. But, maybe that's better than insufficiently descriptive for single-cell.

@schristley
Copy link
Member Author

Can we keep copy_count as duplicate_count or do you think it's better to rename and deprecate to "enforce" the change? I'm still a little hesitant about copy_count as it has semantics with gene copy number in my head.

The problem with the above is that fields have overlapping semantics for some protocols. umi_count and copy_count are confusingly similar for 5'RACE bulk sequencing;

I know of some 5' RACE that doesn't use UMIs, even though they probably should...

copy_count and clone_count are confusingly similar for gDNA TCR bulk sequencing. But, maybe that's better than insufficiently descriptive for single-cell.

This might actually be ok as it could be a quick qualitative check about the clonal assignment process.

@javh
Copy link
Contributor

javh commented Aug 16, 2021

Can we keep copy_count as duplicate_count or do you think it's better to rename and deprecate to "enforce" the change? I'm still a little hesitant about copy_count as it has semantics with gene copy number in my head.

Yeah, we could probably skip read_count and copy_count in favor of preserving consensus_count and duplicate_count. But, they do seem to have confusing semantics. Eg, for data with just "raw" reads (no UMIs, no contig assembly). And duplicate_count has that issue of whether it means "total duplicates" or "total duplicates - 1".

@schristley
Copy link
Member Author

Yeah, we could probably skip read_count and copy_count in favor of preserving consensus_count and duplicate_count. But, they do seem to have confusing semantics. Eg, for data with just "raw" reads (no UMIs, no contig assembly). And duplicate_count has that issue of whether it means "total duplicates" or "total duplicates - 1".

That's true, read_count sounds less ambiguous. For my code logic, I've been using duplicate_count as the replacement for 1 but I can see thinking about the word "duplicate" could create ambiguity. And partially I'm being selfish because I don't want to change all my codes ;-D

if row['duplicate_count'] is None:
    num = 1
else:
    num = row['duplicate_count']

Maybe it would help in the documentation to provide invariants?

count(post-processed RAW reads) == count(reads in RAW sequencing file) - count(pre-processed filtered reads)

sum(read_count in AIRR TSV) == count(post-processed RAW reads)

sum(duplicate_count for a sequence in AIRR TSV) == count(reads for a sequence in post-processed RAW reads)

@bcorrie
Copy link
Contributor

bcorrie commented Aug 16, 2021

@kira-neller can you have a look and comment on this - both from your recent 10X experiences as well as how we have curated data in the past.

Does @schristley's algorithm above work for what we do - in particular when we "unroll" based on duplicate count? I think it does...

@schristley what is read_count in your above calculations? There is no such AIRR field. Is that consensus_count?

@schristley
Copy link
Member Author

@schristley what is read_count in your above calculations? There is no such AIRR field. Is that consensus_count?

Yes, potential rename of consensus_count to read_count

@scharch
Copy link
Contributor

scharch commented Aug 16, 2021

Yeah, we could probably skip read_count and copy_count in favor of preserving consensus_count and duplicate_count.

To clarify for those of us who are slow: so, just adding a new umi_count field?

The problem with the above is that fields have overlapping semantics for some protocols. umi_count and copy_count are confusingly similar for 5'RACE bulk sequencing

I would changed the Required status so that they are essentially mutually exclusive. Either you have UMI data and use umi_count, or you don't and you use duplicate_count.

copy_count and clone_count are confusingly similar for gDNA TCR bulk sequencing. But, maybe that's better than insufficiently descriptive for single-cell.

This might actually be ok as it could be a quick qualitative check about the clonal assignment process.

Yes, I think this is fine

And duplicate_count has that issue of whether it means "total duplicates" or "total duplicates - 1".

I have trouble getting worked up about this, let's just pick one and then make sure the documentation is clear.

Maybe it would help in the documentation to provide invariants?

count(post-processed RAW reads) == count(reads in RAW sequencing file) - count(pre-processed filtered reads)

sum(read_count in AIRR TSV) == count(post-processed RAW reads)

sum(duplicate_count for a sequence in AIRR TSV) == count(reads for a sequence in post-processed RAW reads)

Took me a couple tries to parse these. I think they're useful, but I would also advocate including something like @javh's demo above (#161 (comment)), because that seems like the easiest way of clarifying things, to me.

@javh
Copy link
Contributor

javh commented Aug 16, 2021

That's true, read_count sounds less ambiguous. For my code logic, I've been using duplicate_count as the replacement for 1 but I can see thinking about the word "duplicate" could create ambiguity. And partially I'm being selfish because I don't want to change all my codes ;-D

Well, I'm really averse to breaking backwards compatibility if we can avoid it for the sake of all people's codes. :)

To clarify for those of us who are slow: so, just adding a new umi_count field?

Just adding umi_count and clone_count. We could conceivably add read_count and keep consensus_count, but constraint the meaning of consensus_count specifically to the number of reads used for consensus/assembly.

@scharch
Copy link
Contributor

scharch commented Aug 16, 2021

Just adding umi_count and clone_count.

I thought we decided up-thread that clone_count belongs in Clone, not Rearrangement?

@javh
Copy link
Contributor

javh commented Aug 16, 2021

I thought we decided up-thread that clone_count belongs in Clone, not Rearrangement?

Makes sense to me. We could just put a note in the docs that you can copy clone_count up to Rearrangement if you want to.

@kira-neller
Copy link

kira-neller commented Aug 16, 2021

@kira-neller can you have a look and comment on this - both from your recent 10X experiences as well as how we have curated data in the past.

Does @schristley's algorithm above work for what we do - in particular when we "unroll" based on duplicate count? I think it does...

@bcorrie per 10X documentation they define the two fields as:

duplicate_count | The number of unique molecular identifiers associated with this rearrangement.
consensus_count | The number of reads associated with this rearrangement.

For 10X data, each sequence in the AIRR.tsv is an assembled contig. Let's say we have a contig sequence Z built from reads associated with 2 different UMIs, A and B. As I understand it, if there are 3 UMI-A reads, they contribute +3 to consensus_count, but only +1 to duplicate_count. If there are 5 UMI-B reads, this adds +5 to consensus_count and +1 to duplicate_count. So for contig Z you end up with consensus_count = 3 + 5 = 8; duplicate_count = 1 + 1 = 2.

When we unroll, it is based on duplicate_count. Per the example above, our unrolled AIRR.tsv would have the annotation for contig Z present twice. I'm not completely sure if this agrees with @schristley's definition above; is an assembled contig from a single-cell equivalent to a "sequence in post-processed RAW reads"?

@kira-neller
Copy link

Hi @schristley @javh @scharch, just pinging you on this issue per yesterday's Standards Call. This is also related to #543

Thanks!

@scharch
Copy link
Contributor

scharch commented Nov 12, 2021

We didn't really get into this one in the November call, but related discussion at: #543 (comment)

My personal sense, based on both the desire for parallelism to the Clone *_count fields and the discussion above is that we should:

  • add a umi_count field.
  • revise the documentation of duplicate_count to deprecate its use in favor or umi_count where possible. (duplicate_count would remain available for non-umi contexts.) Per @kira-neller's comment above, this would mean 10x would have to revise/update the output of cellranger.

I think consensus_count looks ok as is and clone_count should not (officially) be part of the Rearrangement schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Documentation
  
Done
Lineage
  
Done
Development

Successfully merging a pull request may close this issue.

7 participants