Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QDNAseq.hg38: Bin annotation for hg38? #59

Open
nikleotide opened this issue Jan 25, 2019 · 33 comments
Open

QDNAseq.hg38: Bin annotation for hg38? #59

nikleotide opened this issue Jan 25, 2019 · 33 comments

Comments

@nikleotide
Copy link

Hi.
I am trying to use QDNAseq for hg38 and thus need to use hg38 rds file. Any suggestion how to create the hg38 file and how to use it?

Thank you very much in advance.
H.

@AaronHechmer
Copy link

Ditto above-- would like to use QDNAseq with hg38 aligned data. I'd be happy to build references if somebody wants to provide comment or recipe on what's required to do so.

@HenrikBengtsson
Copy link
Collaborator

Regarding creating hg38 bin annotations:

Section 'Generating bin annotations' of the 'Introduction to QDNAseq' vignette describes how to build the bin annotation data.

Some quick comments: It looks pretty forward as soon as one identified exactly which genome reference files to download. It appears that it is only the last step on 'median residuals of the LOESS fit' that requires some serious modeling (done on a set of 1000 Genome Project reference samples).

It would be awesome if someone could:

  1. Reproduce the existing instruction for hg19 and validate that we can get numerically reproducible bin annotations as what's in the QDNAseq.hg19 package.
  2. After confirming hg19 can be produced correctly, it should be "easier" to attack hg38

BTW, @lbeltrame mentions in #57 (comment) they have built these based on "1000G fastq files aligned against hg38".

@HenrikBengtsson HenrikBengtsson changed the title rds file for hg38 QDNAseq.hg38: Bin annotation for hg38? Sep 17, 2019
This was referenced Sep 17, 2019
@ameynert
Copy link

  1. Reproduce the existing instruction for hg19 and validate that we can get numerically reproducible bin annotations as what's in the QDNAseq.hg19 package.
  2. After confirming hg19 can be produced correctly, it should be "easier" to attack hg38

I need PE150 hg38 annotations for QDNAseq, so I've had a go at reproducing the hg19 SR50. It's not quite the same; think because my version of bwa is slightly different, but vv good correlation in the residuals (Pearsons 0.994). Everything else is identical. Will report back on the hg38 attempt - may not be able to share as I'm basing the LOESS residuals on a set of downsampled WGS normals from an internal project to keep them as closely matched as possible to my samples.

@HenrikBengtsson
Copy link
Collaborator

Thank you. I think it would also be useful if you shared how you re-generated the hg19 SR50 annotations. Knowing exactly which URLs you used to download reference annotation data will help others who might attempt doing the exact thing. Feel free to cut'n'paste your commnds/scripts here. It will also be valuable because it might help us to re-identify exactly which source data was used in the QDNAseq.hg19 data, cf. #80 (comment).

@ameynert
Copy link

  1. Download 1000 Genomes files for baseline - following the instructions at http://bioconductor.org/packages/release/bioc/vignettes/QDNAseq/inst/doc/QDNAseq.pdf with modified top level URL
urlroot <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/historical_data/former_toplevel"
g1k <- read.table(file.path(urlroot, "sequence.index"), header=TRUE, sep="\t", as.is=TRUE, fill=TRUE)
g1k <- g1k[g1k$INSTRUMENT_PLATFORM == "ILLUMINA", ]
g1k <- g1k[g1k$ANALYSIS_GROUP == "low coverage", ]
g1k <- g1k[g1k$LIBRARY_LAYOUT == "SINGLE", ]
g1k <- g1k[g1k$WITHDRAWN == 0, ]
g1k <- g1k[!g1k$BASE_COUNT %in% c("not available", ""), ]
g1k$BASE_COUNT <- as.numeric(g1k$BASE_COUNT)
g1k$READ_COUNT <- as.integer(g1k$READ_COUNT)
g1k$readLength <- g1k$BASE_COUNT / g1k$READ_COUNT
g1k <- g1k[g1k$readLength > 50, ]
readCountPerSample <- aggregate(g1k$READ_COUNT, by=list(sample=g1k$SAMPLE_NAME), FUN=sum)
g1k <- g1k[g1k$SAMPLE_NAME %in% readCountPerSample$sample[readCountPerSample$x >= 1e6], ]
g1k$fileName <- basename(g1k$FASTQ_FILE)

seqroot = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/"
for (i in rownames(g1k)) { sourceFile <- file.path(seqroot, g1k[i, "FASTQ_FILE"])
 destFile <- g1k[i, "fileName"]
 if (!file.exists(destFile))
 download.file(sourceFile, destFile, mode="wb")
 }
x = subset(g1k, select=c(SAMPLE_NAME, fileName))
write.table(x, "sample_file_map.txt", col.names=F, quote=F, row.names=F, sep="\t")
  1. Trim reads to 50bp with TrimGalore, align with bwa aln and samse, and merge runs for the same sample into a single BAM file. Confirmed 38 samples, which matches the number mentioned in the QDNAseq instructions linked above.

Script submit_process_1kg_hg19_samples.sh

#!/bin/bash

# Requires: trim_galore, samtools, bwa (0.7.16 used, QDNAseq instructions used 0.5.9)
#$ -N qdnaseq_process_samples
#$ -j y
#$ -S /bin/bash
#$ -cwd
#$ -l h_vmem=8G
#$ -l h_rt=24:00:00

unset MODULEPATH
. /etc/profile.d/modules.sh

IDS=$1
MAP=$2

SAMPLE=`head -n $SGE_TASK_ID $IDS | tail -n 1`

# bcbio bwa-indexed hg19 reference
HG19_REF=/path/to/hg19.fa

for FASTQ in `grep ^$SAMPLE $MAP | cut -f 2`
do
  trim_galore --hardtrim5 50 $FASTQ
  BFASTQ=${FASTQ%.fastq.gz}
  COMBINED=${SAMPLE}_${BFASTQ}
  bwa aln -n 2 -q 40 $HG19_REF $BFASTQ.50bp_5prime.fq.gz > $COMBINED.sai
  bwa samse $HG19_REF $COMBINED.sai $BFASTQ.50bp_5prime.fq.gz | samtools view -hb | samtools sort - > ${COMBINED}.bam
  samtools index $COMBINED.bam
done

samtools merge $SAMPLE.bam ${SAMPLE}_*.bam
samtools index $SAMPLE.bam

Generate sample list from sample/file map and run processing steps.

cut -f 1 sample_file_map.txt | sort -u > samples.txt
qsub -t 1-38 submit_process_1kg_hg19_samples.sh samples.txt sample_file_map.txt
  1. Resource files
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign50mer.bigWig
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityRegionsExcludable.bed.gz
gunzip *.gz
  1. Create bins, add mappability and blacklist
library(QDNAseq)
library(Biobase)
library(BSgenome.Hsapiens.UCSC.hg19)

binSize = 15
bins = createBins(bsgenome = BSgenome.Hsapiens.UCSC.hg19, binSize = binSize)
bins$mappability = calculateMappability(bins, bigWigFile = "wgEncodeCrgMapabilityAlign50mer.bigWig", bigWigAverageOverBed = "/path/to/bigWigAverageOverBed")
bins$blacklist = calculateBlacklist(bins, bedFiles = c("wgEncodeDacMapabilityConsensusExcludable.bed", "wgEncodeDukeMapabilityRegionsExcludable.bed"))
saveRDS(bins, file = "hg19_bins_before_loess.rds")
  1. Calculate residuals
samples = read.table("1000G/samples.txt", header=F, stringsAsFactors=F)
bams = paste0("1000G/", samples$V1, ".bam")
ctrl <- binReadCounts(bins, bamfiles = bams)
ctrl <- applyFilters(ctrl, residual=FALSE, blacklist=FALSE, mappability=FALSE, bases=FALSE)
bins$residual <- iterateResiduals(ctrl)
bins$use <- bins$chromosome %in% as.character(1:22) & bins$bases > 0
saveRDS(bins, file = "hg19_bins.rds")
  1. Compare to pre-calculated SR50
library(Biobase)
bins_precalc <- getBinAnnotations(binSize=15)
sum((pData(bins_precalc)$bases - bins$bases))
[1] 0
sum((pData(bins_precalc)$gc - bins$gc), na.rm=TRUE)
[1] 0
sum((pData(bins_precalc)$mappability - bins$mappability))
[1] 0
sum((pData(bins_precalc)$blacklist - bins$blacklist))
[1] 0
cor.test(pData(bins_precalc)$residual, bins$residual)

        Pearson's product-moment correlation

data:  pData(bins_precalc)$residual and bins$residual
t = 3877.7, df = 167510, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9944231 0.9945286
sample estimates:
      cor 
0.9944761 

sum(pData(bins_precalc)$use)
[1] 179187
> sum(bins$use)
[1] 179187

@ameynert
Copy link

38 1000G samples, 241 single end FASTQ files for each, reproducible by following step 1 above.

HG01101
HG01204
HG01495
HG01522
NA06994
NA07051
NA11830
NA11832
NA11918
NA11919
NA11992
NA11994
NA12005
NA18489
NA18498
NA18504
NA18623
NA18624
NA18632
NA18633
NA18636
NA18867
NA18912
NA18960
NA18982
NA18984
NA18986
NA19058
NA19063
NA19064
NA19066
NA19116
NA19138
NA19474
NA19703
NA19707
NA19789
NA19901

@HenrikBengtsson
Copy link
Collaborator

Thank you very much for this

@ameynert
Copy link

The hg38 PE150 bin generation worked fine following the same recipe as for hg19, but with my own internal set of control BAMs downsampled to 3.5X. I've attached plots comparing the stats - mappability of course is going to be quite different, but the G+C and residuals should and do look similar. Also a similar proportion of usable bins. I didn't compare the blacklisting, happy to trust the input resource files on that.

sum(bins19$use) / length(bins19$start)
[1] 0.8681919
sum(bins38$use) / length(bins38$start)
[1] 0.8931605

hg19_SR50_vs_hg38_PE150_qqplot_mappability.pdf
hg19_vs_hg38_qqplot_GC.pdf
hg19_vs_hg38_qqplot_residual.pdf

@leachim
Copy link

leachim commented Nov 1, 2019

Are you planning to provide the annotations for hg38 in a package similar to QDNAseq.hg19?

@Jane-Gibson
Copy link

QDNAseq.hg38 -Would be really helpful, please?

@ameynert
Copy link

ameynert commented Nov 4, 2019

I have hg38 PE150 5kbp and 10kbp bins generated. I'll have to check with the PI about the normal samples used for the residual generation step, as they are from a cancer WGS study, but yes, technically I could share these. It wasn't at all difficult to generate the bins though, just took time and the right set of normal samples.

@leachim
Copy link

leachim commented Dec 3, 2019

Any news on this? Will you be able to share the hg38 bin annotations?

@ameynert
Copy link

ameynert commented Jan 6, 2020 via email

@mmterpstra
Copy link

mmterpstra commented May 20, 2020

The example as shown in #59 (comment):

Step 4 generation of bins needs R3.4.4 max due to the package BSgenome.Hsapiens.UCSC.hg19 /BSgenome.Hsapiens.UCSC.hg38 only being released for bioconductor3.6/R3.4.4 or so.

my resources (hg 38 poorly tested) should fail :
https://gist.github.com/mmterpstra/44f55b589852708fa1bdc0cb09bde517

@asntech
Copy link

asntech commented Sep 9, 2020

Is there any chance to get the bin annotations for ref genome hg38? I'm using @ameynert steps #59 (comment) but it is taking time.

@asntech
Copy link

asntech commented Oct 22, 2020

I've created a package QDNAseq.hg38 for hg38 build using the steps mentioned above, QDNAseq vignette and #80.
It has bin sizes 5, 10, 15, 30, 50, 100, 500 and 1000 kb. It took longer for 1kb but happy to get it if anyone needs it.

Install it from here https://github.com/asntech/QDNAseq.hg38

@AaronHechmer
Copy link

AaronHechmer commented Oct 22, 2020 via email

asntech added a commit to asntech/QDNAseq.hg38 that referenced this issue Oct 24, 2020
@sxu922
Copy link

sxu922 commented Mar 1, 2021

I've created a package QDNAseq.hg38 for hg38 build using the steps mentioned above, QDNAseq vignette and #80.
It has bin sizes 5, 10, 15, 30, 50, 100, 500 and 1000 kb. It took longer for 1kb but happy to get it if anyone needs it.

Install it from here https://github.com/asntech/QDNAseq.hg38

Hi I'm trying to create a bin size 200kb, I'm following the steps, but how do you get from saveRDS(bins, file = "hg19_bins.rds") to hg38.200kbp.SR50.rda? And also if I want to add 200kb to your existing package, do I just add the rda file to the data folder and that's it? I'm really new to R, so please forgave me for asking a silly question.

@asntech
Copy link

asntech commented Mar 2, 2021

@sxu922 I still have the processed bams and other annotations and happy to create 200kb and add to the package. This is because I want to make sure all the bins are computationally reproducible.

@sxu922
Copy link

sxu922 commented Mar 2, 2021

@sxu922 I still have the processed bams and other annotations and happy to create 200kb and add to the package. This is because I want to make sure all the bins are computationally reproducible.

Wow thank you so much for the reply! I didn't expect to get a response so fast! I created the 200kb bin and figured out that I can just read the rds file and use that as the bin. No need to create 200kb for me. Still, thank you so much!

@francesca-guana
Copy link

I've created a package QDNAseq.hg38 for hg38 build using the steps mentioned above, QDNAseq vignette and #80. It has bin sizes 5, 10, 15, 30, 50, 100, 500 and 1000 kb. It took longer for 1kb but happy to get it if anyone needs it.

Install it from here https://github.com/asntech/QDNAseq.hg38

Hi Aziz,
I've installed the package QDNAseq.hg38. I have paired end data with reads length of 150 bp. Can I use the package even if there is no specific experiment type (I saw only SR50 and PE100)?

Thank you

@aliahhawari
Copy link

I've created a package QDNAseq.hg38 for hg38 build using the steps mentioned above, QDNAseq vignette and #80. It has bin sizes 5, 10, 15, 30, 50, 100, 500 and 1000 kb. It took longer for 1kb but happy to get it if anyone needs it.
Install it from here https://github.com/asntech/QDNAseq.hg38

Hi Aziz, I've installed the package QDNAseq.hg38. I have paired end data with reads length of 150 bp. Can I use the package even if there is no specific experiment type (I saw only SR50 and PE100)?

Thank you

I have this query too. Did you manage to run QDNAseq.hg38 for your 150bp reads?

@RodrigoGM
Copy link

RodrigoGM commented Oct 24, 2023

Hi, I'm analyzing data aligned to GRCh38 using QDNAseq.hg38. The system works with no errors, however, when I run callBins it pulls centromeres from b37 (L87 below) and hoped to get some clarifications on this.

How does this impact the re segmentation and calls? What function pulls the centromere coordinates? is there a way to change the assembly for the centromere positions of GRCh38 or T2T? e.g. either passing an argument or automatically retrieving the argument from the featureData.

Please note that the featureData slot in both the segmented only (L77) and copy number called objects (L3747 ) suggests hg38 was used.

Thank you, any help is much appreciated,

77   > lpsB.segmented.copynumbers@featureData
78   QDNAseq bin annotations for Hsapiens, build hg38.
79   Created by Aziz Khan with QDNAseq 1.22.0, 2020-09-11 05:45:57.
80   An object of class 'AnnotatedDataFrame'
81     featureNames: 1:1-100000 1:100001-200000 ... Y:57200001-57227415 (30894 total)
82     varLabels: chromosome start ... use (9 total)
83     varMetadata: labelDescription
84
85  > lpsB.copynumber.calls <- lpsB.segmented.copynumbers %>%                                                                                                                                                                                                                                                                 
86  +     callBins()                                                                                                                                                                                                                                                                                                          
87  + Dividing chromosomes into arms using centromere positions from GRCh37
88 
89  [1] 1
90  [1] 2
...
113 EM algorithm started ...
114 
115 [1] "Total number of segments present in the data: 34434"
116 [1] "Number of segments used for fitting the model: 3046"

...

3747 > lpsB.copynumber.calls@featureData
     QDNAseq bin annotations for Hsapiens, build hg38.
     Created by Aziz Khan with QDNAseq 1.22.0, 2020-09-11 05:45:57.
     An object of class 'AnnotatedDataFrame'
       featureNames: 1:1-100000 1:100001-200000 ... Y:57200001-57227415 (30894 total)
       varLabels: chromosome start ... use (9 total)
       varMetadata: labelDescription

@anagrei
Copy link

anagrei commented Oct 26, 2023

I've created a package QDNAseq.hg38 for hg38 build using the steps mentioned above, QDNAseq vignette and #80. It has bin sizes 5, 10, 15, 30, 50, 100, 500 and 1000 kb. It took longer for 1kb but happy to get it if anyone needs it.
Install it from here https://github.com/asntech/QDNAseq.hg38

Hi Aziz, I've installed the package QDNAseq.hg38. I have paired end data with reads length of 150 bp. Can I use the package even if there is no specific experiment type (I saw only SR50 and PE100)?
Thank you

I have this query too. Did you manage to run QDNAseq.hg38 for your 150bp reads?

I've created a package QDNAseq.hg38 for hg38 build using the steps mentioned above, QDNAseq vignette and #80. It has bin sizes 5, 10, 15, 30, 50, 100, 500 and 1000 kb. It took longer for 1kb but happy to get it if anyone needs it.
Install it from here https://github.com/asntech/QDNAseq.hg38

Hi Aziz, I've installed the package QDNAseq.hg38. I have paired end data with reads length of 150 bp. Can I use the package even if there is no specific experiment type (I saw only SR50 and PE100)?

Thank you

It seems there isn't really a PE100 experiment type in the package. When I tried to use it I got following error:
" QDNAseq bin annotation ‘hg38.1000kbp.PE100’ was not found in package ‘QDNAseq.hg38’. Available bin annotations are: ‘hg38.1000kbp.SR50’, ‘hg38.100kbp.SR50’, ‘hg38.10kbp.SR50’, ‘hg38.15kbp.SR50’, ‘hg38.1kbp.SR50’, ‘hg38.30kbp.SR50’, ‘hg38.500kbp.SR50’, ‘hg38.50kbp.SR50’, ‘hg38.5kbp.SR50’ ".

QDNAseq bin annotation for 100 or 150bp paired end reads will be very much appreciated if available.

@kandabarau
Copy link

  1. Reproduce the existing instruction for hg19 and validate that we can get numerically reproducible bin annotations as what's in the QDNAseq.hg19 package.
  2. After confirming hg19 can be produced correctly, it should be "easier" to attack hg38

I need PE150 hg38 annotations for QDNAseq, so I've had a go at reproducing the hg19 SR50. It's not quite the same; think because my version of bwa is slightly different, but vv good correlation in the residuals (Pearsons 0.994). Everything else is identical. Will report back on the hg38 attempt - may not be able to share as I'm basing the LOESS residuals on a set of downsampled WGS normals from an internal project to keep them as closely matched as possible to my samples.

Dear Alison @ameynert,
are you able to share your hg38 PE150 bin annotations? I am currently searching for how I could analyze our PE150 5x data and 1k genomes does not seem to be a proper reference - they have shorter reads.

Bests,
Sergey

@ameynert
Copy link

ameynert commented Mar 6, 2024

Hi @kandabarau unfortunately the project that the samples came from did not allow for sharing even the derived data.

@mmterpstra
Copy link

asntech/QDNAseq.hg38#2 same issue different flavour. Maybe use a single end for analysis and alignment and then using the workflow adapted to a SE150 bp one in the asntech/QDNAseq.hg38 repo might work.

@kandabarau
Copy link

kandabarau commented Mar 6, 2024

asntech/QDNAseq.hg38#2 same issue different flavour. Maybe use a single end for analysis and alignment and then using the workflow adapted to a SE150 bp one in the asntech/QDNAseq.hg38 repo might work.

I was thinking about this. Unfortunately https://github.com/asntech/QDNAseq.hg38 has no SE150 but SE50 (those 38 1kg samples I could find - have PE100 maximum). I doubt I do not harm the performance if I crop my reads to SE50. Do I?

Anyway thank you!

Bests,
Sergey

@mmterpstra
Copy link

Some regions will likely drop off due to lack of mappability (see post above hg19_SR50_vs_hg38_PE150_qqplot_mappability.pdf). Note that the control residuals are probably used as a weight in the normalization of the test dataset. This why you will likely prefer your own control samples or similarly prepped dataset to improve the fit (not an expert on how this works).

@kandabarau
Copy link

kandabarau commented Mar 7, 2024

Some regions will likely drop off due to lack of mappability (see post above hg19_SR50_vs_hg38_PE150_qqplot_mappability.pdf). Note that the control residuals are probably used as a weight in the normalization of the test dataset. This why you will likely prefer your own control samples or similarly prepped dataset to improve the fit (not an expert on how this works).

thank you @mmterpstra ,

does not ease my life but it was important to hear an opinion from someone who did it already.

How do you think - how important is downsampling of control WGS? If both control and test WGS are PCR-free - how important is to keep their depths comparable?

Bests,
Sergey

fyi @Overnightology

@liuyang2006
Copy link

Hi everyone,

I want use on my T2T CHM13 v2 genome. Can I use hg38, or need the T2T annotation? Is there any ready package for T2T annotation? Thank you!

Best,
Yang

@RodrigoGM
Copy link

Hi @liuyangzzu, you'll need to use the T2T b/c the coordinates aren't the same. I've put together T2T annotations and recently uploaded the repo. At the moment the subset of 1000 genomes for the residuals estimation is still processing, however, the repo is here if you want to begin using it for development. I'll be updating the residuals once the 1000 genomes data is processed.

@liuyang2006
Copy link

Great! Looking forward to your updates for T2T genome. I will check the repo also. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests