Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run CNVkit pipeline (v0.9.7) in wgs mode using separate commands to generate reference files #525

Open
golubnikova opened this issue Jun 8, 2020 · 6 comments
Labels

Comments

@golubnikova
Copy link

Hi, @etal !

I want to run CNVkit pipeline (v0.9.7) in wgs mode using separate commands to generate reference files (*.access.bed, *.reference.cnn), to use --reference option in cnvkit.py batch command.

According to your manual and batch.py,
*target.bed file in wgs mode is identical to *access.bed.

The batch --method wgs option uses the given reference genome’s sequencing-accessible regions (“access” BED) as the “targets” – these will be calculated on the fly if not provided.

But when I run cnvkit.py batch command with --normal option, I see from the logfile that *.target.bed file is generated as well. And as a result I get only *.target.bed file from CNVkit pipeline.

Logfile of cnvkit.py batch command

Wrote GRCh38.d1.vd1.bed with 438 regions
Estimated read length 151.0
WGS average depth 141.50 --> using bin size 353
Detected file format: bed
Splitting large targets
Wrote /output/test_test3/GRCh38.d1.vd1.target.bed with 8288301 regions
Wrote /output/test_test3/GRCh38.d1.vd1.antitarget.bed with 0 regions
Building a copy number reference from normal samples...
Skip processing D2288_n.aligned.bam with empty regions file /output/test_test3/GRCh38.d1.vd1.antitarget.bed
Processing reads in D2288_n.aligned.bam
Wrote /output/test_test3/D2288_n.aligned.antitargetcoverage.cnn with 0 regions
Time: 2167.461 seconds (1086396 reads/sec, 3824 bins/sec)
Summary: #bins=8288301, #reads=2354720755, mean=284.1017, min=0.0, max=148433.6953642384 
Percent reads in regions: 84.309 (of 2792973206 mapped)
Wrote /output/test_test3/D2288_n.aligned.targetcoverage.cnn with 8288301 regions
Relative log2 coverage of chrX=-1.03, chrY=-0.984 (maleness=204 x 1.28 = 260) --> assuming male
Loading /output/test_test3/D2288_n.aligned.targetcoverage.cnn

My running command:
$CNVKIT batch /data/D2288/D2288_t.aligned.bam --normal /data/D2288/D2288_n.aligned.bam --seq-method wgs -p 96 --output-dir /output/test_test3 --output-reference /output/test_test3/ref/reference.cnn --fasta /ref/GRCh38.d1.vd1/GRCh38.d1.vd1.fa |& tee -a /output/test_test3/D2288.batch.log

Does it mean that in wgs mode I should run cnvkit.py target command before generating a *.reference.cnn file as well?
If yes, whith what options should I run cnvkit.py target if I have wgs input data.

Currently, I'm using the following commands to run CNVkit pipeline with separate reference generation and expect to see the equvivalent results (as from the command cnvkit.py batch with --normal option) :

  1. $CNVKIT access $refGenomeFile -o $accessFile
  2. $CNVKIT coverage $bam1 $accessFile -p $PROGRAMNUMCPUS -o $targetCoveraga
  3. $CNVKIT reference $targetCoveraga -f $refGenomeFile -o $referenceFromNormal --no-edge
  4. $CNVKIT batch $bam2 --seq-method $seqMethod --segment-method $segmentation --processes $PROGRAMNUMCPUS --output-dir $runDir --reference $referenceFromNormal

The described above commands are to make tumor analysis, but in germline analysis I have the same issue. If it's necessary - I can show its logfile as well.

Appreciate any recommandation concerning my issue.

With best regards,
Liliya

@etal
Copy link
Owner

etal commented Jun 9, 2020

Is that the complete log? Did batch produce files /output/test_test3/reference.cnn and /output/test_test3/D2288_t.aligned.cnr, or only the output file /output/test_test3/D2288_n.aligned.targetcoverage.cnn?

@golubnikova
Copy link
Author

Attaching a full log:
D2288.batch.log

batch generates the following files:

reference.cnn
D2288_n.aligned.antitargetcoverage.cnn
D2288_n.aligned.targetcoverage.cnn
D2288_t.aligned.antitargetcoverage.cnn
D2288_t.aligned.cnr
D2288_t.aligned.targetcoverage.cnn
GRCh38.d1.vd1.antitarget.bed
GRCh38.d1.vd1.target.bed

At the step Segmenting with method 'cbs', significance threshold 1e-06, in 96 processes script fails with RunTimewarning I issued here

@etal
Copy link
Owner

etal commented Jun 9, 2020

Could you try segmenting the .cnr file directly with a single process? The command is: cnvkit.py segment D2288_t.aligned.cnr -m cbs -p 1

Usually when .cns files do not appear and the logs don't explain why, it's because the R subprocess used for CBS failed. With parallel processing the subprocess logs get lost but with a single process the underlying error should be visible.

The RuntimeWarning you see is probably due to some edge effect in smoothing that I didn't handle properly. It wouldn't be a reason for failure, e.g. in #521 you saw the same warnings but segmentation succeeded (to some extent).

For germline samples, you can also consider bintest to find smaller alterations. That sub-command does not use R and should run pretty fast.

@golubnikova
Copy link
Author

I've checked. It works without error.

Command: $CNVKIT segment /output/v097_somatic/test_time/cnvkitRunDir/D2288_t.aligned.cnr -m cbs -p 1 -o /output/v097_somatic/test_time/cnvkitRunDir/D2288_t.aligned.cns

Logfile of segment command:

Dropped 64 rows with missing log2 values
Segmenting with method 'cbs', significance threshold 0.0001, in 1 processes
Wrote D2288_t.aligned.cns with 35 regions

I have also run the following commands after segment:
$CNVKIT segmetrics -s /output/v097_somatic/test_time/cnvkitRunDir/D2288_t.aligned.cns --ci --pi -o /output/v097_somatic/test_time/cnvkitRunDir/D2288_t.aligned.segmetrics.cns D2288_t.aligned.cnr

$CNVKIT call /output/v097_somatic/test_time/cnvkitRunDir/D2288_t.aligned.segmetrics.cns --filter ci -o /output/v097_somatic/test_time/cnvkitRunDir/D2288_t.aligned.call.cns --drop-low-coverage

All the output files are attached.
D2288_t.aligned.zip
What else could I check for debugging?

@golubnikova
Copy link
Author

For germline samples, you can also consider bintest to find smaller alterations.

Do you mean *.bintest.cns file?
Should it be generated by cnvkit.py batch or by another command?

@golubnikova
Copy link
Author

It seems that after the update of
pysam==0.15.4 -> pysam==0.16.0
numpy==1.18.4 -> numpy==1.18.5

I got a *.bintest.cns file from germline mode of CNVkit's pipeline (batch).
I want to export it in VCF format. How to do it? Shell I make a pre-processing of *.bintest.cns file before export?

When I run export on a *.bintest.cns file from CNVkit's pipeline (batch)
Command:
$CNVKIT export vcf /output/v097_germ/new_output_try2/cnvkitRunDir/ESSE_03-1552_K1_S24.bintest.cns

I have the following log with error:

Treating sample ESSE_03-1552_K1_S24.bintest as male
Traceback (most recent call last):
  File "/usr/local/bin/cnvkit.py", line 9, in <module>
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/cnvlib/commands.py", line 1693, in _cmd_export_vcf
    is_sample_female, args.sample_id, cnarr)
  File "/usr/local/lib/python3.7/dist-packages/cnvlib/export.py", line 238, in export_vcf
    table = pd.DataFrame.from_records(vcf_rows, columns=vcf_columns)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 1596, in from_records
    first_row = next(data)
  File "/usr/local/lib/python3.7/dist-packages/cnvlib/export.py", line 266, in segments2vcf
    out_dframe = segments.data.loc[:, ["chromosome", "end", "log2", "probes"]]
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py", line 1289, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py", line 1954, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py", line 1595, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py", line 1553, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py", line 1655, in _validate_read_indexer
    "Passing list-likes to .loc or [] with any missing labels "
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants