Estimate COI using regression #17

OJWatson · 2022-02-17T16:22:46Z

Am using this instead of email for thoughts and PR explanation

The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.

So this PR has an environmental variable at the moment to basically skip the check if COIAF_CHECK_FREQ_METHOD is set to FALSE. When we do that, running with no sequence error at all, we get following comparisons (limited to COI < 10 - not many outside range):

So basically GATK is actually reducing some of our signal based on the regressed linear model red lines. So I think what you want to do is actually run with all the data (i.e. no GATK filter) and then maybe check how sequence error impacts it (no errror, 0.01, and inferred).

On reflection re code design, I think it is better to remove the environmental variable check but rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.

Unless you can think of a better way of helping method 2 work out which samples are COI = 1 vs COI = 2 but with relatedness then I can't see a way around this. Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?

The regression functions are all called with use_bins = FALSE in your compute_coi and optimize_coi functions. Have set this to default as they seem to work similarly to the bins approach and much quicker.
Will leave you to review this
I have added functions for checking the incoming data for correct formatting, e.g. removing NAs and assigning coverage. General tip, always try and do these types of data formatting upfront and early in your big functions rather than adding the required bits (e.g. adding coverage) later on. That way if we need to later on do other checks or add new columns we know exactly where to go to.
P.S. Fairly sure that now that Method 2 is working correctly, that our idea of Method 2 - Method 1 being an indicator of relatedness is accurate now.

arisp99

Overall LGTM! A couple of notes before merging...

R/process.R

arisp99 · 2022-02-17T20:20:23Z

The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.

On reflection re code design, I think it is better to remove the environmental variable check but rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.

We could try loosening the threshold for setting the COI = 1 by using a higher confidence interval instead of the 95% confidence interval. It may, as you mentioned, be better to just let the algorithm run all the way through and add a note if we did not have enough variant sites. If we use this approach, then we will also see a lot more samples for which we estimate the maximum COI. I do worry that users will not notice the note and not take the uncertainty surrounding the estimation into consideration. I think that the concern whether the users will see the note holds regardless of which strategy we employ.

With that in mind, I am starting to think that the best course of action is to return a special value which makes it clear that there is uncertainty regarding the calculation (perhaps just NA_real_ or NaN). I think we can then add attributes to this result. We can then run our estimation on the data and add a note saying the COI could be 1 or it could be the estimated value. This gives more advanced users the opportunity to choose how to handle these samples while preventing more basic users from making unintentional assumptions about the results.

Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?

I think it is better to leave the two methods separate from one another and not have the Frequency Method call the Variant Method in its estimation.

…er clarity)

OJWatson added 10 commits February 15, 2022 20:54

start of changes for regression fitting

0765cb9

Merge branch 'main' into regression_not_bins

d626982

set bucket size of 1 now to catch for weighting

3fdf874

second commit for regression work - still getting tests to work

ec57ddb

rest of commit

ac78891

Merge branch 'main' into regression_not_bins

cc2584f

few bugs for beta na removal

bee8413

changes for weighted colsums

de61e04

typo in coverage and stats:: for weighted.mean

0c0ae81

and the pkgdown

55de58d

OJWatson marked this pull request as ready for review February 17, 2022 18:10

OJWatson requested a review from arisp99 February 17, 2022 18:10

arisp99 reviewed Feb 17, 2022

View reviewed changes

R/process.R Outdated Show resolved Hide resolved

R/process.R Outdated Show resolved Hide resolved

OJWatson and others added 5 commits February 24, 2022 15:17

filter should be on m_variant not wsmaf (doesn't impact code but bett…

ee86cba

…er clarity)

Merge origin/main into regression_not_bins

fe8586d

Combine check functions

6ff1567

Add check inputs file and create tests

7231cf4

Silence R CMD check notes

d35e685

arisp99 mentioned this pull request Feb 24, 2022

Frequency Method COI = 1 threshold #21

Closed

Don't use an environmental variable

dc22cbb

arisp99 changed the title ~~Regression not bins~~ Estimate COI using regression Feb 24, 2022

arisp99 merged commit 508ea74 into main Feb 24, 2022

arisp99 deleted the regression_not_bins branch February 24, 2022 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate COI using regression #17

Estimate COI using regression #17

OJWatson commented Feb 17, 2022 •

edited

Loading

arisp99 left a comment

arisp99 commented Feb 17, 2022 •

edited

Loading

Estimate COI using regression #17

Estimate COI using regression #17

Conversation

OJWatson commented Feb 17, 2022 • edited Loading

arisp99 left a comment

Choose a reason for hiding this comment

arisp99 commented Feb 17, 2022 • edited Loading

OJWatson commented Feb 17, 2022 •

edited

Loading

arisp99 commented Feb 17, 2022 •

edited

Loading