Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate COI using regression #17

Merged
merged 16 commits into from
Feb 24, 2022
Merged

Estimate COI using regression #17

merged 16 commits into from
Feb 24, 2022

Conversation

OJWatson
Copy link
Collaborator

@OJWatson OJWatson commented Feb 17, 2022

Am using this instead of email for thoughts and PR explanation


  1. The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.

image

So this PR has an environmental variable at the moment to basically skip the check if COIAF_CHECK_FREQ_METHOD is set to FALSE. When we do that, running with no sequence error at all, we get following comparisons (limited to COI < 10 - not many outside range):

image

So basically GATK is actually reducing some of our signal based on the regressed linear model red lines. So I think what you want to do is actually run with all the data (i.e. no GATK filter) and then maybe check how sequence error impacts it (no errror, 0.01, and inferred).

On reflection re code design, I think it is better to remove the environmental variable check but rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.

Unless you can think of a better way of helping method 2 work out which samples are COI = 1 vs COI = 2 but with relatedness then I can't see a way around this. Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?

  1. The regression functions are all called with use_bins = FALSE in your compute_coi and optimize_coi functions. Have set this to default as they seem to work similarly to the bins approach and much quicker.

  2. Will leave you to review this

  3. I have added functions for checking the incoming data for correct formatting, e.g. removing NAs and assigning coverage. General tip, always try and do these types of data formatting upfront and early in your big functions rather than adding the required bits (e.g. adding coverage) later on. That way if we need to later on do other checks or add new columns we know exactly where to go to.

  4. P.S. Fairly sure that now that Method 2 is working correctly, that our idea of Method 2 - Method 1 being an indicator of relatedness is accurate now.

@OJWatson OJWatson marked this pull request as ready for review February 17, 2022 18:10
Copy link
Member

@arisp99 arisp99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! A couple of notes before merging...

R/process.R Outdated Show resolved Hide resolved
R/process.R Outdated Show resolved Hide resolved
@arisp99
Copy link
Member

arisp99 commented Feb 17, 2022

The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.

On reflection re code design, I think it is better to remove the environmental variable check but rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.

We could try loosening the threshold for setting the COI = 1 by using a higher confidence interval instead of the 95% confidence interval. It may, as you mentioned, be better to just let the algorithm run all the way through and add a note if we did not have enough variant sites. If we use this approach, then we will also see a lot more samples for which we estimate the maximum COI. I do worry that users will not notice the note and not take the uncertainty surrounding the estimation into consideration. I think that the concern whether the users will see the note holds regardless of which strategy we employ.

With that in mind, I am starting to think that the best course of action is to return a special value which makes it clear that there is uncertainty regarding the calculation (perhaps just NA_real_ or NaN). I think we can then add attributes to this result. We can then run our estimation on the data and add a note saying the COI could be 1 or it could be the estimated value. This gives more advanced users the opportunity to choose how to handle these samples while preventing more basic users from making unintentional assumptions about the results.

Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?

I think it is better to leave the two methods separate from one another and not have the Frequency Method call the Variant Method in its estimation.

@arisp99 arisp99 changed the title Regression not bins Estimate COI using regression Feb 24, 2022
@arisp99 arisp99 merged commit 508ea74 into main Feb 24, 2022
@arisp99 arisp99 deleted the regression_not_bins branch February 24, 2022 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants