*Ewing et al (2020) Structural variants at the BRCA1/2 loci are a common source of homologous repair deficiency in high grade serous ovarian carcinoma.*

# Notebook 1 - Compile sample information

This notebook contains the code to bring together multiple sources of information into a couple of data frames for use in further downstream analyses.

## RNAseq quantifications

Load libraries for importing transcript level counts and running differential expression analysis in order to transform counts adjusting for cohort and tumour cellularity.

In [11]:
require(tximport)
require(DESeq2)

Load quantification files from salmon for three HGSOC cohorts.

In [12]:
#Load all salmon quantification files for three cohorts
files_scot<-dir("../../alignments/SHGSOC/salmon",pattern="quant.sf",recursive = T,full.names = TRUE)
files_scot_additional<-dir("../../bcbio/SHGSOC/2019-2-26",pattern="quant.sf",recursive = T,full.names = TRUE)
files_aocs<-dir("../AOCS/salmon",pattern="quant.sf",recursive = T,full.names = TRUE)
files_tcga<-dir("../../bcbio/TCGA_US_OV/TCGAvirtualproj",pattern="quant.sf",recursive = T,full.names = TRUE)
files<-c(files_scot,files_scot_additional,files_aocs,files_tcga)

#Remove replicates and sample exclusions 
rna_reps_to_exclude<-read.table("RNAseq_replicates_forexclusion.txt",sep="\t")
rna_reps_to_exclude<-as.character(rna_reps_to_exclude[,1])

new_files_orig<-setdiff(files,rna_reps_to_exclude)
tx2knownGene <- read.csv("tx2gene.csv",header = F)

#Rename files with consistent sample IDs
names(new_files_orig)[1:37]<-do.call("cbind",strsplit(new_files_orig[1:37],split = "[/.]"))[10,]
names(new_files_orig)[38:42]<-do.call("cbind",strsplit(new_files_orig[38:42],split = "[/.]"))[10,]
names(new_files_orig)[43:122]<-do.call("cbind",strsplit(new_files_orig[43:122],split = "[/.]"))[6,]
names(new_files_orig)[123:152]<-do.call("cbind",strsplit(new_files_orig[123:152],split = "[/.]"))[10,]

txi.salmon <- tximport(new_files_orig, type = "salmon", tx2gene = tx2knownGene)

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 
summarizing abundance
summarizing counts
summarizing length


Add in tumour cellularity information and filter to samples with usable WGS.

In [16]:
sample<-read.table("<pathtosampleandcellularitydf>",sep="\t",header=T,stringsAsFactors=F)
sample_expr<-sample[,c("Sample","Purity")]

rna_sampleids<-colnames(head(txi.salmon$counts))
rna_short_sampleids<-rep(NA,length(rna_sampleids))
rna_short_sampleids[1:42]<-substr(rna_sampleids[1:42],1,9)
rna_short_sampleids[43:122]<-substr(rna_sampleids[43:122],1,8)
rna_short_sampleids[123:152]<-substr(rna_sampleids[123:152],1,7)
rna<-data.frame(RNA_sample=rna_sampleids,Sample=rna_short_sampleids)
rna<-rna[as.character(rna$Sample) %in% as.character(sample_expr$Sample),]

new_files<-new_files_orig[as.character(rna$RNA_sample)]

txi.salmon2 <- tximport(new_files, type = "salmon", tx2gene = tx2knownGene)

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 
summarizing abundance
summarizing counts
summarizing length


Correct counts for cohort and cellularity

In [18]:
coldata<-merge(rna,sample_expr,by="Sample",all.x=T)
coldata$Cohort<-substr(as.character(coldata$Sample),1,2)

rownames(coldata)<-as.character(coldata$RNA_sample)
coldata<-coldata[as.character(rna$RNA_sample),]
rownames(coldata)<-c(1:150)

dds_correct <- DESeqDataSetFromTximport(txi.salmon2, 
                               colData = coldata,
                                design=~as.factor(Cohort)+Purity)

“some variables in design formula are characters, converting to factors”using counts and average transcript lengths from tximport


Filter to protein coding genes only and variance stabilising transform counts

In [52]:
biom_res<-read.table("All_genes_quant_type.txt",sep="\t",header=T)
rownames(biom_res)<-as.character(biom_res[,1])
protein_coding_genes<-as.character(biom_res[biom_res$Gene.type=="protein_coding",1])

dds_correct <- DESeq(dds_correct)
vsd <- vst(dds_correct , blind=FALSE)
#write.table(rownames(assay(vsd)),file="All_genes_quant.txt",sep="\t",col.names=F,row.names=F,quote=F)
protein_coding_vsd<-assay(vsd)[protein_coding_genes,]

using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


In [53]:
colnames(protein_coding_vsd)<-coldata$Sample


Get VST expression counts for HR genes

In [60]:
rnaseq_brca1<-protein_coding_vsd['ENSG00000012048',]
rnaseq_brca2<-protein_coding_vsd['ENSG00000139618',]

rnaseq_bard1<-protein_coding_vsd['ENSG00000138376',]
rnaseq_rad50<-protein_coding_vsd['ENSG00000113522',]
rnaseq_nbn<-protein_coding_vsd['ENSG00000104320',]
rnaseq_mre11<-protein_coding_vsd['ENSG00000020922',]
rnaseq_rad51b<-protein_coding_vsd['ENSG00000182185',]
rnaseq_rad51<-protein_coding_vsd['ENSG00000051180',]
rnaseq_palb2<-protein_coding_vsd['ENSG00000083093',]
rnaseq_rad51d<-protein_coding_vsd['ENSG00000185379',]
rnaseq_rad51c<-protein_coding_vsd['ENSG00000108384',]
rnaseq_brip1<-protein_coding_vsd['ENSG00000136492',]

rnaseq_brca<-rbind(rnaseq_brca1,rnaseq_brca2,rnaseq_bard1,rnaseq_rad50,rnaseq_nbn,
                        rnaseq_mre11,rnaseq_rad51b,rnaseq_rad51,rnaseq_palb2,rnaseq_rad51d,
                         rnaseq_rad51c,rnaseq_brip1)
rnaseq_brca<-t(rnaseq_brca)
Sample<-rownames(rnaseq_brca)
all_rnaseq_brca<-data.frame(Sample=Sample,rnaseq_brca)
colnames(all_rnaseq_brca)<-c("Sample","BRCA1_VST","BRCA2_VST","BARD1_VST","RAD50_VST","NBN_VST","MRE11_VST","RAD51B_VST",
                               "RAD51_VST","PALB2_VST","RAD51D_VST","RAD51C_VST","BRIP1_VST")

dim(all_rnaseq_brca)

Write VST expression counts for HR genes to file.

In [62]:
all_rnaseq_brca[,-1]<-apply(all_rnaseq_brca[,-1],2,as.numeric)

write.table(all_rnaseq_brca,file="Manuscript/Intermediate_data/RNAseq_TPMs_VST.txt",sep="\t",row.names=F,quote=F)

## Whole genome doubling 

In [63]:
facets_score<-read.table("<path to facets output>",sep="\t",header=T,stringsAsFactors=F)
colnames(facets_score)<-c("Sample","Facets_WGD_score")

## Rearrangement signatures and HRDetect prediction

In [64]:
hrdetect<-read.table("Tables/HGSOC_HRDetect_results.txt",sep="\t",header=T,stringsAsFactors=F)
hrdetect_score<-hrdetect[,c(1,4,5,14)]
colnames(hrdetect_score)<-c("Sample","ReSig_3","ReSig_5","HRDetect")

## Tumour cellularity

In [65]:
purity<-read.table("<Path_to_CLImAT_tumour_purity>",sep="\t",header=T,stringsAsFactors=F)

## BRCA1/2 Loss of heterozygosity 

In [1]:
brca1_loh<-read.table("<pathtobrca1LOH>",sep="\t",stringsAsFactors=F)
colnames(brca1_loh)<-c("BAF","Sample")
brca1_loh$Sample<-as.character(brca1_loh$Sample)
brca1_loh$BRCA1_LOH<-ifelse(brca1_loh$BAF==1,1,0)
brca1_loh<-brca1_loh[,2:3]
brca1_loh$Sample<-gsub("T","",brca1_loh$Sample)

In [2]:
brca2_loh<-read.table("<pathtobrca2LOH>",sep="\t",stringsAsFactors=F)
colnames(brca2_loh)<-c("BAF","Sample")
brca2_loh$BRCA2_LOH<-ifelse(brca2_loh$BAF==1,1,0)
brca2_loh<-brca2_loh[,2:3]

brca2_loh$Sample<-gsub("T","",brca1_loh$Sample)

Write BRCA1/2 LOH status to file

In [None]:
loh<-merge(brca1_loh,brca2_loh,by="Sample")
write.table(loh,"~/Desktop/BRCA1_BRCA2_SVs_paper/Analysis/LOH_status.txt",sep="\t",quote=F,row.names=F)

## Non-BRCA HR gene mutational status

Add in samples that have deleterious SNVs/indels at non-BRCA genes as defined by KEGG

In [68]:
aocs_nonBRCAHRgerm<-c("AOCS_063","AOCS_065","AOCS_079","AOCS_097","AOCS_106","AOCS_108","AOCS_125","AOCS_143",
                      "AOCS_158","AOCS_163","AOCS_164","AOCS_168","AOCS_170")

shgsoc_nonBRCAHRgerm<-c("SHGSOC027","SHGSOC034","SHGSOC037","SHGSOC043","SHGSOC045","SHGSOC054","SHGSOC065","SHGSOC076","SHGSOC078",
                        "SHGSOC082","SHGSOC084","SHGSOC088","SHGSOC099","SHGSOC101","SHGSOC102")

tcga_nonBRCAHRgerm<-c("DO28004","DO28093","DO28412","DO28763","DO29146","DO30060","DO31551")


aocs_nonBRCAHRsom<-c("AOCS_111","AOCS_131")
shgsoc_nonBRCAHRsom<-c("SHGSOC015","SHGSOC031","SHGSOC037","SHGSOC042","SHGSOC044")

nonHR<-data.frame(Sample=brca2_loh$Sample, non_BRCA_HR_Germline_SNV=rep(0,210),non_BRCA_HR_Somatic_SNV=rep(0,210))
rownames(nonHR)<-as.character(nonHR$Sample)
nonHR[unique(c(aocs_nonBRCAHRgerm,shgsoc_nonBRCAHRgerm,tcga_nonBRCAHRgerm)),"non_BRCA_HR_Germline_SNV"]<-1
nonHR[unique(c(aocs_nonBRCAHRsom,shgsoc_nonBRCAHRsom)),"non_BRCA_HR_Somatic_SNV"]<-1

## Mutational load

Incorporate mutational loads of SNVs/indels, large CNVs (>1Mb) and structural variants called by Manta.

In [None]:
mut_load<-read.table("Analysis/HGSOC_SNV_mutational_load.txt",sep="\t")
colnames(mut_load)<-c("Sample","Mutational_load")
rownames(mut_load)<-as.character(mut_load[,1])

SV_load<-read.table("Analysis/HGSOC_SV_mutational_load.txt",sep="\t")
colnames(SV_load)<-c("Sample","SV_load")
rownames(SV_load)<-as.character(SV_load[,1])

CNV_load<-read.table("Analysis/LargeCNV_load_cnvkit-climat_filt_overlap.txt")
CNV_load<-CNV_load[,c(2,1)]
colnames(CNV_load)<-c("Sample","CNV_load")
rownames(CNV_load)<-as.character(CNV_load[,1])

## BRCA1/2 promoter methylation 

In [None]:
brca1_pro_meth_aocs<-read.table("../methylation/AOCS/AOCS_samples_BRCA1_promoter_methylation.txt",sep="\t")
brca1_pro_meth_aocs<-as.character(brca1_pro_meth_aocs[,1])

brca1_pro_meth_tcga<-read.table("../methylation/TCGA/TCGA_samples_BRCA1_promoter_methylation.txt",sep="\t")
brca1_pro_meth_tcga<-as.character(brca1_pro_meth_tcga[,1])
brca1_pro_meth_samples<-c(brca1_pro_meth_aocs,brca1_pro_meth_tcga)

brca1_pro_meth<-data.frame(Sample=mut_load$Sample)
rownames(brca1_pro_meth)<-as.character(brca1_pro_meth$Sample)
brca1_pro_meth$BRCA1_pro_meth<-0
brca1_pro_meth[brca1_pro_meth_samples,"BRCA1_pro_meth"]<-1
table(brca1_pro_meth$BRCA1_pro_meth)

## BRCA1 mutational status`

*SNVs*

In [74]:
aocsBRCA1germ<-c("AOCS_034","AOCS_057","AOCS_058","AOCS_065","AOCS_088","AOCS_095","AOCS_105","AOCS_108","AOCS_131","AOCS_139","AOCS_143","AOCS_145","AOCS_146")
shgsocBRCA1germ<-c("SHGSOC001","SHGSOC007","SHGSOC022","SHGSOC056","SHGSOC060","SHGSOC094")
tcgaBRCA1germ<-c("DO28089","DO30220","DO30340","DO32420")

aocsBRCA1som<-c("AOCS_079","AOCS_086","AOCS_130","AOCS_152","AOCS_171")
shgsocBRCA1som<-c("SHGSOC011","SHGSOC031","SHGSOC072")
tcgaBRCA1som<-c("DO28521","DO32391")

*SVs*

In [None]:
highconf<-read.csv("<pathtohighconfSVs>",stringsAsFactors=F)
co<-substr(highconf$Sample,1,2)
highconf$Cohort<-sapply(co,function(x) switch(x,"SH"="SHGSOC","DO"="TCGA","AO"="AOCS"))

Classifying high-confidence AOCS SVs.

In [None]:
aocs_brca_svs<-highconf[highconf$Cohort=="AOCS",]
    
aocs_brca_svs$BRCA_SV_category<-NA
aocs_brca_svs[aocs_brca_svs$BRCA_mutation_category=="Deletion","BRCA_SV_category"]<-"Deletion overlapping exon (LOF)"
aocs_brca_svs[aocs_brca_svs$BRCA_mutation_category=="Duplication","BRCA_SV_category"]<-"Duplication spanning gene (COPY_GAIN)"
aocs_brca_svs[aocs_brca_svs$BRCA_mutation_category=="Inversion","BRCA_SV_category"]<-"Inversion spanning gene (INV_SPAN)"
aocs_brca_svs[aocs_brca_svs$BRCA_mutation_category=="Complex_incl_del" ,"BRCA_SV_category"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
aocs_brca_svs[aocs_brca_svs$BRCA_mutation_category=="Complex_no_del","BRCA_SV_category"]<-"Complex combination of SVs without LOF"

Classifying high-confidence SHGSOC SVs

In [None]:
shgsoc_brca_svs<-highconf[highconf$Cohort=="SHGSOC",]

shgsoc_brca_svs$BRCA_SV_category<-NA
shgsoc_brca_svs[shgsoc_brca_svs$BRCA_mutation_category=="Deletion","BRCA_SV_category"]<-"Deletion overlapping exon (LOF)"
shgsoc_brca_svs[shgsoc_brca_svs$BRCA_mutation_category=="Duplication","BRCA_SV_category"]<-"Duplication spanning gene (COPY_GAIN)"
shgsoc_brca_svs[shgsoc_brca_svs$BRCA_mutation_category=="Inversion","BRCA_SV_category"]<-"Inversion spanning gene (INV_SPAN)"
shgsoc_brca_svs[shgsoc_brca_svs$BRCA_mutation_category=="NoLOF" ,"BRCA_SV_category"]<-"SV without LOF"
shgsoc_brca_svs[shgsoc_brca_svs$BRCA_mutation_category=="Complex_no_del","BRCA_SV_category"]<-"Complex combination of SVs without LOF"


Classifying high-confidence TCGA SVs

In [None]:
tcga_brca_svs<-highconf[highconf$Cohort=="TCGA",]

tcga_brca_svs$BRCA_SV_category<-NA
tcga_brca_svs[tcga_brca_svs$BRCA_mutation_category=="Deletion","BRCA_SV_category"]<-"Deletion overlapping exon (LOF)"
tcga_brca_svs[tcga_brca_svs$BRCA_mutation_category=="Duplication","BRCA_SV_category"]<-"Duplication spanning gene (COPY_GAIN)"
tcga_brca_svs[tcga_brca_svs$BRCA_mutation_category=="Inversion","BRCA_SV_category"]<-"Inversion spanning gene (INV_SPAN)"
tcga_brca_svs[tcga_brca_svs$BRCA_mutation_category=="Intragenic_duplication" ,"BRCA_SV_category"]<-"Intragenic exonic duplication"
tcga_brca_svs[tcga_brca_svs$BRCA_mutation_category=="Complex_incl_del","BRCA_SV_category"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
tcga_brca_svs[tcga_brca_svs$BRCA_mutation_category=="Complex_no_del","BRCA_SV_category"]<-"Complex combination of SVs without LOF"


Creating BRCA1/2 mutational status binary indicators.

In [80]:
aocs_brca1<-data.frame(BRCA1_Germline_SNV=rep(0,80),BRCA1_Somatic_SNV=rep(0,80),BRCA1_LOF=rep(0,80),BRCA1_COPY_GAIN=rep(0,80),BRCA1_INV_SPAN=rep(0,80),BRCA1_CPX_LOF=rep(0,80),BRCA1_CPX_noLOF=rep(0,80),BRCA1_noLOF=rep(0,80),BRCA1_intgendup=rep(0,80))
rownames(aocs_brca1)<-as.character(hrdetect_score[grep("AOCS",hrdetect_score$Sample),"Sample"])

aocs_brca1[aocsBRCA1germ,"BRCA1_Germline_SNV"]<-1
aocs_brca1[aocsBRCA1som,"BRCA1_Somatic_SNV"]<-1
brca1_lof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="Deletion overlapping exon (LOF)","Sample"])),1,8)
aocs_brca1[brca1_lof_samples,"BRCA1_LOF"]<-1
brca1_copygain_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="Duplication spanning gene (COPY_GAIN)","Sample"])),1,8)
aocs_brca1[brca1_copygain_samples,"BRCA1_COPY_GAIN"]<-1
brca1_invspan_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="Inversion spanning gene (INV_SPAN)","Sample"])),1,8)
aocs_brca1[brca1_invspan_samples,"BRCA1_INV_SPAN"]<-1
brca1_cpxlof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="Complex combination of SV intervals including 1+ LOF (CPX: LOF)","Sample"])),1,8)
aocs_brca1[brca1_cpxlof_samples,"BRCA1_CPX_LOF"]<-1
brca1_cpxnolof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="Complex combination of SVs without LOF","Sample"])),1,8)
aocs_brca1[brca1_cpxnolof_samples,"BRCA1_CPX_noLOF"]<-1
brca1_nolof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="SV without LOF","Sample"])),1,8)
aocs_brca1[brca1_nolof_samples,"BRCA1_noLOF"]<-1
brca1_intragenicdup_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA1" & aocs_brca_svs$BRCA_SV_category=="Intragenic exonic duplication","Sample"])),1,8)
aocs_brca1[brca1_intragenicdup_samples,"BRCA1_intgendup"]<-1

In [81]:
shgsoc_brca1<-data.frame(BRCA1_Germline_SNV=rep(0,85),BRCA1_Somatic_SNV=rep(0,85),BRCA1_LOF=rep(0,85),BRCA1_COPY_GAIN=rep(0,85),BRCA1_INV_SPAN=rep(0,85),BRCA1_CPX_LOF=rep(0,85),BRCA1_CPX_noLOF=rep(0,85),BRCA1_noLOF=rep(0,85),BRCA1_intgendup=rep(0,85))
rownames(shgsoc_brca1)<-as.character(hrdetect_score[grep("SHGSOC",hrdetect_score$Sample),"Sample"])

shgsoc_brca1[shgsocBRCA1germ,"BRCA1_Germline_SNV"]<-1
shgsoc_brca1[shgsocBRCA1som,"BRCA1_Somatic_SNV"]<-1
brca1_lof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="Deletion overlapping exon (LOF)","Sample"])),1,9)
shgsoc_brca1[brca1_lof_samples,"BRCA1_LOF"]<-1
brca1_copygain_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="Duplication spanning gene (COPY_GAIN)","Sample"])),1,9)
shgsoc_brca1[brca1_copygain_samples,"BRCA1_COPY_GAIN"]<-1
brca1_invspan_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="Inversion spanning gene (INV_SPAN)","Sample"])),1,9)
shgsoc_brca1[brca1_invspan_samples,"BRCA1_INV_SPAN"]<-1
brca1_cpxlof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="Complex combination of SV intervals including 1+ LOF (CPX: LOF)","Sample"])),1,8)
shgsoc_brca1[brca1_cpxlof_samples,"BRCA1_CPX_LOF"]<-1
brca1_cpxnolof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="Complex combination of SVs without LOF","Sample"])),1,9)
shgsoc_brca1[brca1_cpxnolof_samples,"BRCA1_CPX_noLOF"]<-1
brca1_nolof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="SV without LOF","Sample"])),1,9)
shgsoc_brca1[brca1_nolof_samples,"BRCA1_noLOF"]<-1
brca1_intragenicdup_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA1" & shgsoc_brca_svs$BRCA_SV_category=="Intragenic exonic duplication","Sample"])),1,9)
shgsoc_brca1[brca1_intragenicdup_samples,"BRCA1_intgendup"]<-1


In [82]:
tcga_brca1<-data.frame(BRCA1_Germline_SNV=rep(0,45),BRCA1_Somatic_SNV=rep(0,45),BRCA1_LOF=rep(0,45),BRCA1_COPY_GAIN=rep(0,45),BRCA1_INV_SPAN=rep(0,45),BRCA1_CPX_LOF=rep(0,45),BRCA1_CPX_noLOF=rep(0,45),BRCA1_noLOF=rep(0,45),BRCA1_intgendup=rep(0,45))

rownames(tcga_brca1)<-as.character(hrdetect_score[grep("DO",hrdetect_score$Sample),"Sample"])

tcga_brca1[tcgaBRCA1germ,"BRCA1_Germline_SNV"]<-1
tcga_brca1[tcgaBRCA1som,"BRCA1_Somatic_SNV"]<-1

BRCA1_lof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="Deletion overlapping exon (LOF)","Sample"])),1,7)
tcga_brca1[BRCA1_lof_samples,"BRCA1_LOF"]<-1
BRCA1_copygain_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="Duplication spanning gene (COPY_GAIN)","Sample"])),1,7)
tcga_brca1[BRCA1_copygain_samples,"BRCA1_COPY_GAIN"]<-1
BRCA1_invspan_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="Inversion spanning gene (INV_SPAN)","Sample"])),1,7)
tcga_brca1[BRCA1_invspan_samples,"BRCA1_INV_SPAN"]<-1
BRCA1_cpxlof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="Complex combination of SV intervals including 1+ LOF (CPX: LOF)","Sample"])),1,7)
tcga_brca1[BRCA1_cpxlof_samples,"BRCA1_CPX_LOF"]<-1
BRCA1_cpxnolof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="Complex combination of SVs without LOF","Sample"])),1,7)
tcga_brca1[BRCA1_cpxnolof_samples,"BRCA1_CPX_noLOF"]<-1
BRCA1_nolof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="SV without LOF","Sample"])),1,7)
tcga_brca1[BRCA1_nolof_samples,"BRCA1_noLOF"]<-1
BRCA1_intragenicdup_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA1" & tcga_brca_svs$BRCA_SV_category=="Intragenic exonic duplication","Sample"])),1,7)
tcga_brca1[BRCA1_intragenicdup_samples,"BRCA1_intgendup"]<-1


## BRCA2 mutational status

*SNVs*

In [83]:
aocsBRCA2germ<-c("AOCS_104","AOCS_153")
shgsocBRCA2germ<-c("SHGSOC005","SHGSOC009","SHGSOC043","SHGSOC051","SHGSOC059","SHGSOC100")
tcgaBRCA2germ<-c("DO29980","DO30650","DO30970","DO32237")

aocsBRCA2som<-c("AOCS_063","AOCS_122","AOCS_147","AOCS_149")
shgsocBRCA2som<-"SHGSOC090"
tcgaBRCA2som<-c("DO28119","DO28273","DO31869")


*SVs*

In [84]:
aocs_brca2<-data.frame(BRCA2_Germline_SNV=rep(0,80),BRCA2_Somatic_SNV=rep(0,80),BRCA2_LOF=rep(0,80),BRCA2_COPY_GAIN=rep(0,80),BRCA2_INV_SPAN=rep(0,80),BRCA2_CPX_LOF=rep(0,80),BRCA2_CPX_noLOF=rep(0,80),BRCA2_noLOF=rep(0,80),BRCA2_intgendup=rep(0,80))
rownames(aocs_brca2)<-as.character(hrdetect_score[grep("AOCS",hrdetect_score$Sample),"Sample"])

aocs_brca2[aocsBRCA2germ,"BRCA2_Germline_SNV"]<-1
aocs_brca2[aocsBRCA2som,"BRCA2_Somatic_SNV"]<-1
brca2_lof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="Deletion overlapping exon (LOF)","Sample"])),1,8)
aocs_brca2[brca2_lof_samples,"BRCA2_LOF"]<-1
brca2_copygain_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="Duplication spanning gene (COPY_GAIN)","Sample"])),1,8)
aocs_brca2[brca2_copygain_samples,"BRCA2_COPY_GAIN"]<-1
brca2_invspan_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="Inversion spanning gene (INV_SPAN)","Sample"])),1,8)
aocs_brca2[brca2_invspan_samples,"BRCA2_INV_SPAN"]<-1
brca2_cpxlof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="Complex combination of SV intervals including 1+ LOF (CPX: LOF)","Sample"])),1,8)
aocs_brca2[brca2_cpxlof_samples,"BRCA2_CPX_LOF"]<-1
brca2_cpxnolof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="Complex combination of SVs without LOF","Sample"])),1,8)
aocs_brca2[brca2_cpxnolof_samples,"BRCA2_CPX_noLOF"]<-1
brca2_nolof_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="SV without LOF","Sample"])),1,8)
aocs_brca2[brca2_nolof_samples,"BRCA2_noLOF"]<-1
brca2_intragenicdup_samples<-substr(unique(as.character(aocs_brca_svs[aocs_brca_svs$Gene =="BRCA2" & aocs_brca_svs$BRCA_SV_category=="Intragenic exonic duplication","Sample"])),1,8)
aocs_brca2[brca2_intragenicdup_samples,"BRCA2_intgendup"]<-1

In [85]:
shgsoc_brca2<-data.frame(BRCA2_Germline_SNV=rep(0,85),BRCA2_Somatic_SNV=rep(0,85),BRCA2_LOF=rep(0,85),BRCA2_COPY_GAIN=rep(0,85),BRCA2_INV_SPAN=rep(0,85),BRCA2_CPX_LOF=rep(0,85),BRCA2_CPX_noLOF=rep(0,85),BRCA2_noLOF=rep(0,85),BRCA2_intgendup=rep(0,85))
rownames(shgsoc_brca2)<-as.character(hrdetect_score[grep("SHGSOC",hrdetect_score$Sample),"Sample"])

shgsoc_brca2[shgsocBRCA2germ,"BRCA2_Germline_SNV"]<-1
shgsoc_brca2[shgsocBRCA2som,"BRCA2_Somatic_SNV"]<-1
brca2_lof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="Deletion overlapping exon (LOF)","Sample"])),1,9)
shgsoc_brca2[brca2_lof_samples,"BRCA2_LOF"]<-1
brca2_copygain_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="Duplication spanning gene (COPY_GAIN)","Sample"])),1,9)
shgsoc_brca2[brca2_copygain_samples,"BRCA2_COPY_GAIN"]<-1
brca2_invspan_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="Inversion spanning gene (INV_SPAN)","Sample"])),1,9)
shgsoc_brca2[brca2_invspan_samples,"BRCA2_INV_SPAN"]<-1
brca2_cpxlof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="Complex combination of SV intervals including 1+ LOF (CPX: LOF)","Sample"])),1,9)
shgsoc_brca2[brca2_cpxlof_samples,"BRCA2_CPX_LOF"]<-1
brca2_cpxnolof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="Complex combination of SVs without LOF","Sample"])),1,9)
shgsoc_brca2[brca2_cpxnolof_samples,"BRCA2_CPX_noLOF"]<-1
brca2_nolof_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="SV without LOF","Sample"])),1,9)
shgsoc_brca2[brca2_nolof_samples,"BRCA2_noLOF"]<-1
brca2_intragenicdup_samples<-substr(unique(as.character(shgsoc_brca_svs[shgsoc_brca_svs$Gene =="BRCA2" & shgsoc_brca_svs$BRCA_SV_category=="Intragenic exonic duplication","Sample"])),1,9)
shgsoc_brca2[brca2_intragenicdup_samples,"BRCA2_intgendup"]<-1


In [86]:
tcga_brca2<-data.frame(BRCA2_Germline_SNV=rep(0,45),BRCA2_Somatic_SNV=rep(0,45),BRCA2_LOF=rep(0,45),BRCA2_COPY_GAIN=rep(0,45),BRCA2_INV_SPAN=rep(0,45),BRCA2_CPX_LOF=rep(0,45),BRCA2_CPX_noLOF=rep(0,45),BRCA2_noLOF=rep(0,45),BRCA2_intgendup=rep(0,45))
rownames(tcga_brca2)<-as.character(hrdetect_score[grep("DO",hrdetect_score$Sample),"Sample"])

tcga_brca2[tcgaBRCA2germ,"BRCA2_Germline_SNV"]<-1
tcga_brca2[tcgaBRCA2som,"BRCA2_Somatic_SNV"]<-1
BRCA2_lof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="Deletion overlapping exon (LOF)","Sample"])),1,7)
tcga_brca2[BRCA2_lof_samples,"BRCA2_LOF"]<-1
BRCA2_copygain_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="Duplication spanning gene (COPY_GAIN)","Sample"])),1,7)
tcga_brca2[BRCA2_copygain_samples,"BRCA2_COPY_GAIN"]<-1
BRCA2_invspan_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="Inversion spanning gene (INV_SPAN)","Sample"])),1,7)
tcga_brca2[BRCA2_invspan_samples,"BRCA2_INV_SPAN"]<-1
BRCA2_cpxlof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="Complex combination of SV intervals including 1+ LOF (CPX: LOF)","Sample"])),1,7)
tcga_brca2[BRCA2_cpxlof_samples,"BRCA2_CPX_LOF"]<-1
BRCA2_cpxnolof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="Complex combination of SVs without LOF","Sample"])),1,7)
tcga_brca2[BRCA2_cpxnolof_samples,"BRCA2_CPX_noLOF"]<-1
BRCA2_nolof_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="SV without LOF","Sample"])),1,7)
tcga_brca2[BRCA2_nolof_samples,"BRCA2_noLOF"]<-1
BRCA2_intragenicdup_samples<-substr(unique(as.character(tcga_brca_svs[tcga_brca_svs$Gene =="BRCA2" & tcga_brca_svs$BRCA_SV_category=="Intragenic exonic duplication","Sample"])),1,7)
tcga_brca2[BRCA2_intragenicdup_samples,"BRCA2_intgendup"]<-1

In [87]:
brca1<-rbind(aocs_brca1,shgsoc_brca1,tcga_brca1)
brca2<-rbind(aocs_brca2,shgsoc_brca2,tcga_brca2)

brca<-merge(brca1,brca2,by=0)
colnames(brca)[1]<-"Sample"

In [89]:
write.table(brca,file="Manuscript/Intermediate_data/BRCAstatus.txt",sep="\t",quote=F,row.names=F)

## Format BRCA status mutually exclusive categories

Status' with SNVs/indels included.

In [90]:
brca$BRCAstatus<-"None"
brca$BRCA1status<-"None"
brca$BRCA2status<-"None"

brca[brca$BRCA1_LOF==1,"BRCA1status"]<-"Deletion overlapping exon (LOF)"
brca[brca$BRCA1_COPY_GAIN==1,"BRCA1status"]<-"Duplication spanning gene (COPY_GAIN)"
brca[brca$BRCA1_INV_SPAN==1,"BRCA1status"]<-"Inversion spanning gene (INV_SPAN)"
brca[brca$BRCA1_CPX_LOF==1,"BRCA1status"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
brca[brca$BRCA1_CPX_noLOF==1,"BRCA1status"]<-"Complex combination of SVs without LOF"
brca[brca$BRCA1_noLOF==1,"BRCA1status"]<-"SV without LOF"
brca[brca$BRCA1_intgendup==1,"BRCA1status"]<-"Intragenic exonic duplication"
brca[brca$BRCA1_Somatic_SNV==1,"BRCA1status"]<-"Somatic SNV"
brca[brca$BRCA1_Germline_SNV==1,"BRCA1status"]<-"Germline SNV"

brca[brca$BRCA2_LOF==1,"BRCA2status"]<-"Deletion overlapping exon (LOF)"
brca[brca$BRCA2_COPY_GAIN==1,"BRCA2status"]<-"Duplication spanning gene (COPY_GAIN)"
brca[brca$BRCA2_INV_SPAN==1,"BRCA2status"]<-"Inversion spanning gene (INV_SPAN)"
brca[brca$BRCA2_CPX_LOF==1,"BRCA2status"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
brca[brca$BRCA2_CPX_noLOF==1,"BRCA2status"]<-"Complex combination of SVs without LOF"
brca[brca$BRCA2_noLOF==1,"BRCA2status"]<-"SV without LOF"
brca[brca$BRCA2_intgendup==1,"BRCA2status"]<-"Intragenic exonic duplication"
brca[brca$BRCA2_Somatic_SNV==1,"BRCA2status"]<-"Somatic SNV"
brca[brca$BRCA2_Germline_SNV==1,"BRCA2status"]<-"Germline SNV"

brca[brca$BRCA1_intgendup==1 | brca$BRCA2_intgendup==1,"BRCAstatus"]<-"Intragenic exonic duplication"
brca[brca$BRCA1_noLOF==1 | brca$BRCA2_noLOF==1,"BRCAstatus"]<-"SV without LOF"
brca[brca$BRCA1_CPX_noLOF==1 | brca$BRCA2_CPX_noLOF==1,"BRCAstatus"]<-"Complex combination of SVs without LOF"
brca[brca$BRCA1_CPX_LOF==1 | brca$BRCA2_CPX_LOF==1,"BRCAstatus"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
brca[brca$BRCA1_COPY_GAIN==1|brca$BRCA2_COPY_GAIN==1,"BRCAstatus"]<-"Duplication spanning gene (COPY_GAIN)"
brca[brca$BRCA1_INV_SPAN==1 | brca$BRCA2_INV_SPAN==1,"BRCAstatus"]<-"Inversion spanning gene (INV_SPAN)"
brca[brca$BRCA1_LOF==1 | brca$BRCA2_LOF==1,"BRCAstatus"]<-"Deletion overlapping exon (LOF)"
brca[brca$BRCA1_Somatic_SNV==1 | brca$BRCA2_Somatic_SNV==1,"BRCAstatus"]<-"Somatic SNV"
brca[brca$BRCA1_Germline_SNV==1 | brca$BRCA2_Germline_SNV==1,"BRCAstatus"]<-"Germline SNV"

Status' irrespective of SNVs/indels.

In [92]:
brca$BRCAstatus_SV<-"SV absent"
brca$BRCA1status_SV<-"SV absent"
brca$BRCA2status_SV<-"SV absent"

brca[brca$BRCA1_LOF==1,"BRCA1status_SV"]<-"Deletion overlapping exon (LOF)"
brca[brca$BRCA1_COPY_GAIN==1,"BRCA1status_SV"]<-"Duplication spanning gene (COPY_GAIN)"
brca[brca$BRCA1_INV_SPAN==1,"BRCA1status_SV"]<-"Inversion spanning gene (INV_SPAN)"
brca[brca$BRCA1_CPX_LOF==1,"BRCA1status_SV"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
brca[brca$BRCA1_CPX_noLOF==1,"BRCA1status_SV"]<-"Complex combination of SVs without LOF"
brca[brca$BRCA1_noLOF==1,"BRCA1status_SV"]<-"SV without LOF"
brca[brca$BRCA1_intgendup==1,"BRCA1status_SV"]<-"Intragenic exonic duplication"


brca[brca$BRCA2_LOF==1,"BRCA2status_SV"]<-"Deletion overlapping exon (LOF)"
brca[brca$BRCA2_COPY_GAIN==1,"BRCA2status_SV"]<-"Duplication spanning gene (COPY_GAIN)"
brca[brca$BRCA2_INV_SPAN==1,"BRCA2status_SV"]<-"Inversion spanning gene (INV_SPAN)"
brca[brca$BRCA2_CPX_LOF==1,"BRCA2status_SV"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
brca[brca$BRCA2_CPX_noLOF==1,"BRCA2status_SV"]<-"Complex combination of SVs without LOF"
brca[brca$BRCA2_noLOF==1,"BRCA2status_SV"]<-"SV without LOF"
brca[brca$BRCA2_intgendup==1,"BRCA2status_SV"]<-"Intragenic exonic duplication"

brca[brca$BRCA1_intgendup==1 | brca$BRCA2_intgendup==1,"BRCAstatus_SV"]<-"Intragenic exonic duplication"
brca[brca$BRCA1_noLOF==1 | brca$BRCA2_noLOF==1,"BRCAstatus_SV"]<-"SV without LOF"
brca[brca$BRCA1_CPX_noLOF==1 | brca$BRCA2_CPX_noLOF==1,"BRCAstatus_SV"]<-"Complex combination of SVs without LOF"
brca[brca$BRCA1_CPX_LOF==1 | brca$BRCA2_CPX_LOF==1,"BRCAstatus_SV"]<-"Complex combination of SV intervals including 1+ LOF (CPX: LOF)"
brca[brca$BRCA1_COPY_GAIN==1|brca$BRCA2_COPY_GAIN==1,"BRCAstatus_SV"]<-"Duplication spanning gene (COPY_GAIN)"
brca[brca$BRCA1_INV_SPAN==1 | brca$BRCA2_INV_SPAN==1,"BRCAstatus_SV"]<-"Inversion spanning gene (INV_SPAN)"
brca[brca$BRCA1_LOF==1 | brca$BRCA2_LOF==1,"BRCAstatus_SV"]<-"Single deletion"
brca[brca$BRCA1_LOF==1 & brca$BRCA2_LOF==1,"BRCAstatus_SV"]<-"Double deletion"


Compound mutational status'.

In [None]:
brca$BRCAstatus_compound<-"Excluded"
brca$BRCA1status_compound<-"Excluded"
brca$BRCA2status_compound<-"Excluded"

brca[brca$BRCAstatus=="None","BRCAstatus_compound"]<-"None"
brca[brca$BRCA1status=="None","BRCA1status_compound"]<-"None"
brca[brca$BRCA2status=="None","BRCA2status_compound"]<-"None"

brca[((brca$BRCA1status=="Germline SNV" | brca$BRCA1status=="Somatic SNV") & 
      brca$BRCA1status_SV=="Deletion overlapping exon (LOF)" &
      brca$BRCA2status_SV!="Deletion overlapping exon (LOF)"),
"BRCA1status_compound"]<-"SNV + deletion (same gene)"

brca[((brca$BRCA2status=="Germline SNV" | brca$BRCA2status=="Somatic SNV") & 
      brca$BRCA2status_SV=="Deletion overlapping exon (LOF)" &
     brca$BRCA1status_SV!="Deletion overlapping exon (LOF)"),
"BRCA2status_compound"]<-"SNV + deletion (same gene)"

brca[brca$BRCA1status_compound=="SNV + deletion (same gene)" | brca$BRCA2status_compound=="SNV + deletion (same gene)",
     "BRCAstatus_compound"]<-"SNV + deletion (same gene)"

#SNV + deletion (different gene)
brca[((brca$BRCA1status=="Germline SNV" | brca$BRCA1status=="Somatic SNV") & 
      brca$BRCA1status_SV!="Deletion overlapping exon (LOF)" &
      brca$BRCA2status_SV=="Deletion overlapping exon (LOF)"),
"BRCA1status_compound"]<-"SNV + deletion (other gene)"

brca[((brca$BRCA2status=="Germline SNV" | brca$BRCA2status=="Somatic SNV") & 
      brca$BRCA2status_SV!="Deletion overlapping exon (LOF)" &
     brca$BRCA1status_SV=="Deletion overlapping exon (LOF)"),
"BRCA2status_compound"]<-"SNV + deletion (other gene)"

brca[brca$BRCA1status_compound=="SNV + deletion (other gene)" | brca$BRCA2status_compound=="SNV + deletion (other gene)",
     "BRCAstatus_compound"]<-"SNV + deletion (other gene)"

#SNV + deletion (both genes)
brca[((brca$BRCA1status=="Germline SNV" | brca$BRCA1status=="Somatic SNV") & 
      brca$BRCA1status_SV=="Deletion overlapping exon (LOF)" &
      brca$BRCA2status_SV=="Deletion overlapping exon (LOF)"),
"BRCA1status_compound"]<-"SNV + deletions (both genes)"

brca[((brca$BRCA2status=="Germline SNV" | brca$BRCA2status=="Somatic SNV") & 
      brca$BRCA2status_SV=="Deletion overlapping exon (LOF)" &
     brca$BRCA1status_SV=="Deletion overlapping exon (LOF)"),
"BRCA2status_compound"]<-"SNV + deletions (both genes)"

brca[brca$BRCA1status_compound=="SNV + deletions (both genes)" | brca$BRCA2status_compound=="SNV + deletions (both genes)",
     "BRCAstatus_compound"]<-"SNV + deletions (both genes)"

brca[(brca$BRCA1_LOF==1 & brca$BRCA2_LOF==1 ),"Double_del"]<-"Double deletion"


## Merge datasets

In [104]:
brca$Sample<-as.character(brca$Sample)
dat1<-merge(brca[,c("Sample","BRCAstatus","BRCA1status","BRCA2status","BRCAstatus_SV","BRCA1status_SV","BRCA2status_SV","BRCA1status_compound","BRCA2status_compound","BRCAstatus_compound","Double_del")],all_rnaseq_brca,by="Sample",all.x=T)
dat2<-merge(dat1,facets_score,by="Sample")
dat3<-merge(dat2,purity,by="Sample")
dat4<-merge(dat3,brca1_loh,by="Sample")
dat5<-merge(dat4,brca2_loh,by="Sample")
dat6<-merge(dat5,nonHR,by="Sample")
dat7<-merge(dat6,mut_load,by="Sample")
dat8<-merge(dat7,SV_load,by="Sample")
dat9<-merge(dat8,CNV_load,by="Sample")
dat10<-merge(dat9,brca1_pro_meth,by="Sample")
SampleInfo<-merge(dat10,hrdetect_score,by="Sample")


## Late sample exclusions

Exclude 1 uterine carcinoma, 3 carcinosarcomas and one sample with contamination in the paired normal.

In [105]:
SampleInfo<-SampleInfo[(SampleInfo$Sample %in% c("SHGSOC060","SHGSOC064","SHGSOC027","SHGSOC005","DO30650"))=="FALSE",]
dim(SampleInfo)

Add derived variables

In [106]:
SampleInfo$WGD<-0
SampleInfo[SampleInfo$Facets_WGD_score>=0.5,"WGD"]<-1

In [107]:
SampleInfo$Cohort<-substr(SampleInfo$Sample,1,2)

Create full and reduced feature sets

In [116]:
SampleInfo_base<-SampleInfo[,c(1,39,2:11,27:28,34,29:30,12,13,24,38,25:26,31:33,35:37)]
SampleInfo_full<-SampleInfo[,c(1,39,2:11,27:28,34,29:30,12,13,24,38,25:26,31:33,35:37,14:23)]

rownames(SampleInfo_base)<-as.character(SampleInfo_base$Sample)
rownames(SampleInfo_full)<-as.character(SampleInfo_full$Sample)

## Output data frames to file

In [117]:
write.table(SampleInfo_base,file="Manuscript/Intermediate_data/SampleInformation.txt",sep="\t",quote=F)
write.table(SampleInfo_full,file="Manuscript/Intermediate_data/SampleInformation_full.txt",sep="\t",quote=F)

## Format data frame for multivariable modelling

In [None]:
brca<-brca[(brca$Sample %in% c("SHGSOC060","SHGSOC064","SHGSOC027","SHGSOC005","DO30650"))=="FALSE",]

lasso_samples<-data.frame(brca[,c("Sample","BRCA1_Germline_SNV","BRCA1_Somatic_SNV","BRCA1_LOF","BRCA2_Germline_SNV","BRCA2_Somatic_SNV","BRCA2_LOF","BRCA1_INV_SPAN","BRCA2_COPY_GAIN","BRCA1status_compound","BRCA2status_compound","Double_del")],
                         SampleInfo_full[,grep("VST",names(SampleInfo_full))],BRCA1_pro_meth=SampleInfo_full$BRCA1_pro_meth,
                         non_BRCA_Somatic_SNV=SampleInfo_full$non_BRCA_HR_Somatic_SNV,non_BRCA_Germline_SNV=SampleInfo_full$non_BRCA_HR_Germline_SNV,
                        Mutational_load=SampleInfo_full$Mutational_load,SV_load=SampleInfo_full$SV_load,
                         CNV_load=SampleInfo_full$CNV_load,WGD=SampleInfo_full$WGD,Cellularity=SampleInfo_full$Purity,

                        HRDetect=SampleInfo_full$HRDetect,
                       
                       Cohort=SampleInfo_full$Cohort)

write.table(lasso_samples,file="Manuscript/Intermediate_data/SampleInformation_withBRCAstatus.txt",sep="\t",row.names=F,quote=F)

lasso_samples_expr<-lasso_samples[is.na(lasso_samples$BRCA1_VST)==FALSE,setdiff(colnames(lasso_samples),"BRCAstatus")]

rf_dataset<-lasso_samples_expr

genomic_dataset<-lasso_samples[,setdiff(colnames(lasso_samples),
                                        c("BRCA1_VST","BRCA2_VST","BARD1_VST","RAD50_VST","NBN_VST","MRE11_VST" ,
                                          "RAD51B_VST" ,"RAD51_VST","PALB2_VST","RAD51D_VST","RAD51C_VST" , 
                                          "BRIP1_VST", "BRCAstatus","non_BRCA_Somatic_SNV","non_BRCA_Germline_SNV"))]


In [132]:
print(names(lasso_samples_expr))

 [1] "Sample"                "BRCA1_Germline_SNV"    "BRCA1_Somatic_SNV"    
 [4] "BRCA1_LOF"             "BRCA2_Germline_SNV"    "BRCA2_Somatic_SNV"    
 [7] "BRCA2_LOF"             "BRCA1_INV_SPAN"        "BRCA2_COPY_GAIN"      
[10] "BRCA1status_compound"  "BRCA2status_compound"  "Double_del"           
[13] "BRCA1_VST"             "BRCA2_VST"             "BARD1_VST"            
[16] "RAD50_VST"             "NBN_VST"               "MRE11_VST"            
[19] "RAD51B_VST"            "RAD51_VST"             "PALB2_VST"            
[22] "RAD51D_VST"            "RAD51C_VST"            "BRIP1_VST"            
[25] "BRCA1_pro_meth"        "non_BRCA_Somatic_SNV"  "non_BRCA_Germline_SNV"
[28] "Mutational_load"       "SV_load"               "CNV_load"             
[31] "WGD"                   "Cellularity"           "HRDetect"             
[34] "Cohort"               


In [133]:
print(names(genomic_dataset))

 [1] "Sample"               "BRCA1_Germline_SNV"   "BRCA1_Somatic_SNV"   
 [4] "BRCA1_LOF"            "BRCA2_Germline_SNV"   "BRCA2_Somatic_SNV"   
 [7] "BRCA2_LOF"            "BRCA1_INV_SPAN"       "BRCA2_COPY_GAIN"     
[10] "BRCA1status_compound" "BRCA2status_compound" "Double_del"          
[13] "BRCA1_pro_meth"       "Mutational_load"      "SV_load"             
[16] "CNV_load"             "WGD"                  "Cellularity"         
[19] "HRDetect"             "Cohort"              


## Output datasets for modelling

In [134]:
write.table(lasso_samples_expr,file="Manuscript/Intermediate_data/Dataset_for_lasso_regression.txt",sep="\t",row.names=F,quote=F)
write.table(rf_dataset,file="Manuscript/Intermediate_data/Dataset_for_randomforest.txt",sep="\t",row.names=F,quote=F)
write.table(genomic_dataset,file="Manuscript/Intermediate_data/Dataset_for_elastic_regression_genomic.txt",sep="\t",row.names=F,quote=F)
