Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result of add_to_seurat #208

Closed
yksaito opened this issue Jan 9, 2020 · 9 comments
Closed

Result of add_to_seurat #208

yksaito opened this issue Jan 9, 2020 · 9 comments

Comments

@yksaito
Copy link

yksaito commented Jan 9, 2020

Hi,

I performed run and add_to_seurat.
It was successful, but I couldn't understand the columns of the meta.data.
I read https://github.com/broadinstitute/infercnv/wiki/Extracting-features, but I'm sure what each column represents. Is there any explanation available? Thanks.

 "has_cnv_chr1"                  "has_loss_chr1"                 "has_dupli_chr1"
"proportion_cnv_chr1"          "proportion_loss_chr1"          "proportion_dupli_chr1"
"proportion_scaled_cnv_chr1"    "proportion_scaled_loss_chr1"    "proportion_scaled_dupli_chr1"
...
"top_loss_1"                    "top_loss_2"                   "top_loss_3"
...
@GeorgescuC
Copy link
Collaborator

Hi @yksaito ,

All of these results are based on the HMM predictions.
The fields has_cnv/has_loss/has_dupli are 0/1 boolean answers to whether respectively any kind of cnv, a loss or a duplication is found within the given chr.
proportion_cnv/proportion_loss/proportion_dupli is the proportion in number of genes that are part of any cnv/loss cnv/duplication cnv in the given chr.
proportion_scaled are similar to proportion but taking into account whether genes that are duplications/losses are of a single copy or two copies as a weight.
top_loss_n/top_dupli_n are the n loss/duplication cnvs that cover the most genes.

If you need more details about any of them, please let me know which one.

Best,
Christophe.

@kylandra
Copy link

Hi,
these metadata could be very useful, but I do not understand how to get more information for each top loss / top duplication.
How can I identify the genes that are included in losses or duplications?
thanks

@GeorgescuC
Copy link
Collaborator

Hi @kylandra ,

At this time this information is not output in a very easy to use text format, but I will add it when I get the time to.

Until then, you can try using the crude debug output that add_to_seurat can output. For that, you need to change the log level by running the following:

library(futile.logger)
flog.threshold(DEBUG)

You can then rerun the add_to_seurat method again and it should output a list of the region names (as per defined in the .pred_cnv_regions.dat file) for each of the top CNVs. Losses will be the first set of top CNVs, duplications will be the second set.

Regards,
Christophe.

@etlioglu
Copy link

Hi Christophe,

AIso in line with the original question above:

I have a tumor sample and I am using all all non-epithelial cell types as references with the argument ref_group_names in infercnv::CreateInfercnvObject().

When I extract HMM features with add_to_seurat() and then do rowSums (or != 0) on all of the has_cnv_* columns to assign a binary (0 or 1) "malignancy" value to each cell, I observe non-zero values for the cells that I have provided as reference.

                              
                               FALSE TRUE
  B cells                         93    0
  Cytotoxic T cells              204    0
  Endothelial cells              490    0
  Epithelial cells                77  318
  Monocyte/Macrophage            115  336
  Myofibroblast                   19   90
  other                            0    9
  T cells                         39  133

In that respect, I have some questions, could you please comment on these?

  1. How does inferCNV assign a binary score for the has_cnv_* columns? Is there a statistic associated with the confidence around these?
  2. Would my approach of collapsing the info within the has_cnv_* columns make sense or any drawbacks that I failed to see there?
  3. There is a chance that some of these high CNV scoring cells could be mis-labeled, however, given the vast difference between the cell types above (as well as having been used different methods for cell annotation), the probability is not high. What else can result in such behavior? Below, I have pasted the heatmap output and I can see some "events" in the Monocyte/Macrophage compartment but not really in the T cells for example.
  4. Regarding the heatmap output, I kind of see two groups within the "Observations", top two thirds vs the rest. Despite having run infercnv::run() with analysis_mode="subclusters, there is only one cluster, how can I make this step "more sensitive"?

Thanks a lot for this great tool!

Emre

Screen Shot 2020-03-18 at 16 28 40

@GeorgescuC
Copy link
Collaborator

Hi @etlioglu ,

1-2) This field is very basic. If any size of CNV is predicted in a given cell/chr, it is assigned 1. It can often make more sense to use the proportion_cnv_* field with a threshold based on what you see in the references if your references are not very "clean" of signal (like it seem to be the case for a couple regions in your figure). We also tend to be wary of results on chr X and Y, and MT.

  1. It is hard to tell without having insight on the experiment, but it could be due to some experimental artifacts or actual differences in cells of the same type.
    How many genes are kept in this analysis after the initial filtering? It can also sometimes happen that having too many different cell types in a run makes the selection of expressed genes inaccurate, either because some cell types are too few in number so their specific genes get filtered out or because too many genes they don't express are kept so the signal distillates.
    The signature seen on some of the Monocyte/Macrophage group of cells also appears to be seen in the cluster of cells at the top of the lower 'half' of the observations (the top of the lower main split), so I would filter those out.

  2. It seems like the output you posted is the final residual expression figure, not the HMM predictions, which are the ones that take into account the subclusters (subclusters are not displayed in the color bars). For the HMM figures, you should look for figures from steps 17 (HMM predictions) and 19 (Bayesian filtered HMM predictions) if you are using the latest version.

Regards,
Christophe.

@AidenSb
Copy link

AidenSb commented Aug 2, 2021

Hi @GeorgescuC
Would you please explain what are the exact numbers that proportion_dupli/loss is multiplied by to get scaled_dupli/loss?

please take a look at my results. I can't understand why a dupli value should be multiplied by 0.5?? if this is regarding the Loss values then the feature should not be in proportion_dupli from the beginning.

image

Thank you very much for your time.

@GeorgescuC
Copy link
Collaborator

Hi @AidenSb ,

The multiplier used for scaling each CNV region individually is number_of_copies_gained_or_lost/2), where the divided by 2 comes from healthy cells being diploid.
Let's take the example where you have a single CNV, which is a gain/duplication CNV that covers 20% (0.2) of a chromosome. In that case, the proportion of the chromosome that is part of a CNV is 0.2 . Now for the scaled proportion, we also look at how many copies are gained. If there are 2 extra copies, you have 20% (of that chromosome size) more DNA, but if you only have 1 extra copy, you only have 10% (of that chromosome size) more DNA.

Regards,
Christophe.

@AidenSb
Copy link

AidenSb commented Aug 2, 2021

Hi @AidenSb ,

The multiplier used for scaling each CNV region individually is number_of_copies_gained_or_lost/2), where the divided by 2 comes from healthy cells being diploid.
Let's take the example where you have a single CNV, which is a gain/duplication CNV that covers 20% (0.2) of a chromosome. In that case, the proportion of the chromosome that is part of a CNV is 0.2 . Now for the scaled proportion, we also look at how many copies are gained. If there are 2 extra copies, you have 20% (of that chromosome size) more DNA, but if you only have 1 extra copy, you only have 10% (of that chromosome size) more DNA.

Regards,
Christophe.

Thanks a lot for the explanation Christophe, @GeorgescuC
Just to link this to the HMMi6 predictions. in the case of Amplifications HMMs ( 4, 5, 6) these multipliers should be (HMM 4: 0.5, HMM 5: 1, and HMM 6: 1.5.
Am I right?

also on this wiki page https://github.com/broadinstitute/infercnv/wiki/Extracting-features, you need to change the color limit to (0,1.5) considering the (full chr CNV 1/1 with more than 2 copy-number >> 1 * (3/5) = 1.5). This plot was the main reason I had this issue and asked you about it.

Cheers,
Aiden

@GeorgescuC
Copy link
Collaborator

Hi @AidenSb ,

Yes you are right for the multipliers.

I added a note on the wiki page you mentioned to outline that type of cases, but left the default as (0,1) as it still remains the most relevant use case, and when there are more than 2 extra copies, their exact number becomes more difficult to accurately distinguish.

Regards,
Christophe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants