Skip to content

dnil/strdrop

Repository files navigation

strdrop

STRdrop logo - a shooting star with a STRawberry

PyPI Version GitHub Release Date GitHub commits latest GitHub commit rate GitHub issues-closed Average time to resolve an issue Percentage of issues still open Ruff uv Woke

Flag STR coverage drops in LRS data

A simple tool to leverage a collection of reference samples to calculate normal sequencing depths over loci in a locus catalog, and flagging sites with drops in new, N+1, samples.

Call coverage drop if alleles for a repeat in the test file

  • are identical, or fairly similar, defined as within EDIT_RATIO_CUTOFF (default 0.9) Levenshtein edit distance ratio
  • are covered (have a total sequencing depth SD) at a fraction of CASE_COVERAGE_RATIO_CUTOFF (default 0.55) or below of case average locus coverage
  • are among the lowest alpha/N_loci (default 0.05/N_loci) of total sequencing depth values for that locus compared to the test set

Coverage drop calls are marked in the output VCF with FILTER tag LowDepth and INFO tag STRDROP. Edit ratio STRDROP_EDR, case coverage ratio STRDROP_SDR and coverage marginal P value STRDROP_P are output as INFO keys on the resulting VCF.

Usage

 Usage: strdrop [OPTIONS] COMMAND [ARGS]...

 Call coverage drops over alleles in STR VCFs

╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --version  -v        Show version and exit.                                                                                               │
│ --help               Show this message and exit.                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ call    STRdrop: Detect drops in STR coverage 🧬                                                                                          │
│ build   STRdrop: Build reference json from sequencing coverage in STR VCFs                                                                │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯



 Usage: strdrop call [OPTIONS] INPUT_FILE OUTPUT_FILE

 STRdrop: Detect drops in STR coverage 🧬

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_file       PATH  Input STR call VCF file [required]                                                                            │
│ *    output_file      PATH  Output annotated VCF file [required]                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --training-set        PATH       Training VCF directory or json with reference data [required]                                         │
│    --xy                  SAMPLE ID  Sample in VCF to treat as karyotype XY. May be specified multiple times.                              │
│    --alpha               FLOAT      Unadjusted probability confidence level for coverage test [default: 0.05]                             │
│    --fraction            FLOAT      Case average adjusted sequencing depth ratio cutoff [default: 0.55]                                   │
│    --edit                FLOAT      Allele similarity Levenshtein edit distance ratio cutoff [default: 0.9]                               │
│    --help                           Show this message and exit.                                                                           │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯



 Usage: strdrop build [OPTIONS] REFERENCE_FILE

 STRdrop: Build reference json from sequencing coverage in STR VCFs

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    reference_file      PATH  Output reference archive [required]                                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --training-set        PATH  Input directory with reference data [required]                                                             │
│    --help                      Show this message and exit.                                                                                │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

You can test the functionality with one of the test files with relatively low coverage:

uv run strdrop call --fraction 0.55 --training-set test-data test-data/GV-42.trgt.vcf GV-42.strdrop.vcf

To avoid spurious calls on X for XY karyotype, pass the --xy <sample_name> option to lower the coverage ratio expectation with a 0.5 shift.

uv run strdrop call --xy GV-42 --fraction 0.55 --training-set test-data test-data/GV-42.trgt.vcf GV-42.strdrop.vcf

Note the tags in the output VCF.

➜  strdrop git:(main) ✗ grep HTT GV-42.strdrop.vcf
chr4	3074876	.	CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCGCCT	CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCGCCT,CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCGCCT	.	LowDepth	TRID=HD_HTT;END=3074969;MOTIFS=CAG,CCG;STRUC=<TR>;STRDROP_P=0;STRDROP_EDR=0.967033;STRDROP_SDR=0.470972;STRDROP	GT:AL:ALLR:SD:MC:MS:AP:AM	1/2:84,87:83-84,87-93:25,32:17_8,18_8:0(0-48)_0(51-54)_1(54-57)_1(60-81),0(0-51)_0(54-57)_1(57-60)_1(63-84):0.964286,0.965517:0.02,0.01```
➜  strdrop git:(main) ✗ grep Strdrop GV-42.strdrop.vcf |head -9
##INFO=<ID=STRDROP_P,Number=1,Type=Float,Description="Strdrop coverage sequencing depth level probability">
##INFO=<ID=STRDROP_EDR,Number=1,Type=Float,Description="Strdrop allele similarity Levenshtein edit distance ratio">
##INFO=<ID=STRDROP_SDR,Number=1,Type=Float,Description="Strdrop case average adjusted sequencing depth ratio">
##INFO=<ID=STRDROP,Number=0,Type=Flag,Description="Strdrop coverage drop detected">
##FILTER=<ID=LowDepth,Description="Strdrop coverage drop detected">
##FORMAT=<ID=SDP,Number=1,Type=Float,Description="Strdrop coverage sequencing depth level probability">
##FORMAT=<ID=EDR,Number=1,Type=Float,Description="Strdrop allele similarity Levenshtein edit distance ratio">
##FORMAT=<ID=SDR,Number=1,Type=Float,Description="Strdrop case average adjusted sequencing depth ratio">
##FORMAT=<ID=DROP,Number=1,Type=String,Description="Strdrop coverage drop detected, 1 for LowDepth">

Output files can be written as compressed VCF or BCF by simply giving an appropriate outfile name suffix, thanks to CyVCF2.

strdrop call --training-set /home/daniel.nilsson/proj/strdrop/reference/ mycase.trgt.vcf.gz mycase.trgt.strdrop.vcf.gz

About

Flag STR coverage drops in LRS data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages